Next: Kernel-Based Reinforcement Learning Up: No Title Previous: No Title

High-Dimensional Regression

Like most statistical smoothing approaches, kernel-based methods suffer from the so-called ``curse-of-dimensionality'' when applied to multivariate data: The proportion of the training data that lie in a fixed-radius neighborhood of a point decreases to zero at an exponential rate with increasing dimension of the input space. Due to this problem, the bandwidth of a weighting kernel must be chosen very big so as to contain a reasonable sample fraction, and the estimates produced are typically highly biased. One possible way to reduce the bias of local linear estimates is to estimate the ``shape'' of the weighting kernel using training data.

**Abbildung:** *Left:* Example of a single index model of the form $y = g(x'\kappa )$ with $\kappa = (1,1)$ and $g(z) = \tanh(3z)$ . *Right:* The contours of g(z) are straight lines orthogonal to $\kappa$ . The contours of an optimized weighting kernel is stretched in the orthogonal direction.
figure=singidx.eps,width= figure=singctur.eps,width=

This idea is illustrated in Figure 1. The function on the left is sigmoidal along an index vector $\kappa$ and constant in directions orthogonal to $\kappa$ . Therefore, a ``shaped'' weighting kernel is shrunk in the direction $\kappa$ and stretched in the orthogonal direction (right), thus minimizing the exposure of the kernel to the nonlinear variation.

To express this idea formally, we consider local linear estimates of the form

$\begin{displaymath}\hat f(x_0,\Omega) = x'({X^0}'W^0 X^0)^{-1} {X^0}'W^0 Y, \end{displaymath}$

(2)

where X⁰ is the $T\times (d+1)$ design matrix with rows (1,x_t'-x₀')', Y is the vector of response values, and W⁰ is a $T\times T$ weighting matrix with entries W⁰_t,t = k(x_t,x₀). Here k(x_t,x₀) is a non-negative weighting kernel - for example, a Gaussian - based on the weighted Eucledian distance

$\begin{displaymath}\vert\vert x_t-x_0\vert\vert _\Omega \equiv \sqrt{(x_t - x_0)'\Omega (x_t-x_0)}. \end{displaymath}$

(3)

Estimating the optimal kernel shape amounts to estimating the optimal $\Omega$ in (2). For this purpose, it is advantageous to distinguish the bandwidth and the shape of $\Omega$ . Formally, we rewrite

$\begin{displaymath}\Omega \equiv \lambda \cdot (L L'+ I). \end{displaymath}$

(4)

Here the bandwidth parameter $\lambda$ regards the size of the neighborhood used for averaging, and the shape matrix L describes deviations from the Eucledian metric. In particular, L may be of reduced rank and $\Omega$ is always positive definite given that $\lambda >0$ .

The task is to choose $\lambda$ and L so as to guarantee a high predictive performance. To specify $\lambda$ , we define the ``entropy'' of the weighting kernel k(x_t,x₀) as a function of $\Omega$ :

$\begin{displaymath}H(\Omega) \equiv - \sum_{t=1}^T k(x_t,x_0) \log k(x_t,x_0). \end{displaymath}$

(5)

Then $\lambda$ can be defined as an implicit function of L and the data using $H(\Omega) = \log k$ for some constant k. Intuitively, this constraint prevents overfitting due to very ``small'' values of $\Omega$ . The optimal shape matrix L can be estimated from the training data by minimizing the constrained mean squared prediction error:

$\begin{displaymath}{\cal E}(L;D) \equiv \sum_{t=1}^T (y_t - \hat f(x_t;\Omega))^2. \end{displaymath}$

(6)

A key issue for the practical realization of this approach is the numerical minimization of (5). We suggest to use a gradient-descent algorithm where the difficulty consists of evaluating the error gradient $\pderiv({\cal E},{L})$ . We use implicit differentiation together with the chain rule to compute this magnitude. Note that the bandwidth parameter $\lambda$ must be updated after each gradient descent step which can be computationally expensive. We developed a special algorithm to carry out this computation efficiently.

We applied kernel shaping to a five- and a ten-dimensional toy example, and a third, real-world example. On the two artificial data sets, kernel shaping clearly outperformed local linear regression using a spherical kernel. Also, it led to a small performance improvement in the real-world example.

Next: Kernel-Based Reinforcement Learning Up: No Title Previous: No Title

Dirk Ormoneit
1999-10-17