datamin next up previous
Nächste Seite: Literatur

Dirk Ormoneit
Trevor Hastie
Volker Tresp

Data Mining

High-Dimensional Regression

Statistical learning suffers from the so-called ``curse-of-dimensionality'' when applied to high-dimensional data: The proportion of the training data that lie in a fixed-radius neighborhood of a point decreases to zero at an exponential rate with increasing dimension of the input space. While it is impossible to ``break'' the curse-of-dimensionality in general, we can frequently reduce the effective number of dimensions by finding a suitable projection of the original data into a simplified domain. In this work, we design a new algorithm that determines this projection automatically using ``kernel shaping''. Specifically, we consider local linear estimates and we estimate the optimal ``shape'' of the weighting kernel using training data. This idea is illustrated in Figures 1 and 2. The function on the left is sigmoidal along an index vector $\kappa $ and constant in directions orthogonal to $\kappa $. Therefore, a ``shaped'' weighting kernel is shrunk in the direction $\kappa $ and stretched in the orthogonal direction (right), thus minimizing the exposure of the kernel to the nonlinear variation.

Abbildung: Example of a single index model of the form $
y = g(x'\kappa)
$ with $\kappa = (1,1)$ and $g(z) = \tanh(3z)$.

\begin{figure}
\centerline {\psfig {figure=singidx.eps,width=.8\textwidth}}\par\end{figure}

Abbildung 2: The contours of $g(z)$ are straight lines orthogonal to $\kappa $. The contours of an optimized weighting kernel is stretched in the orthogonal direction.
\begin{figure}
\centerline {\psfig {figure=singctur.eps,width=.8\textwidth}}\par\end{figure}

We applied kernel shaping to a five- and a ten-dimensional toy example, and a third, real-world example. On the two artificial data sets, kernel shaping clearly outperformed local linear regression using a spherical kernel. Also, it led to a small performance improvement in the real-world example.

Gaussian Mixture Density Estimation

In another project, we consider Gaussian mixture density estimates of high-dimensional datasets. It is well-known that unregularized density estimation using Gaussian mixtures may well lead to overfitting: In the extreme case, the log-likelihood can go to infinity if the center of one Gaussian coincides with one of the data points and its variance approaches the null-matrix (see Figure [*], left). I compare three alternatives to deal with this situation:

  1. Averaging
    Averaging can improve the performance of a predictor if the individual models are sufficiently diverse. I compare several resampling schemes to increase diversity including bagging.
  2. Maximum Penalized Likelihood
    An alternative approach is to add a penalty term to the log-likelihood function as a regularizer. The maximum penalized likelihood approach is equivalent to the maximum a posterior (MAP) parameter estimate in a Bayesian approach if we interpret the penalty as the negative logarithm of the prior distribution. In particular, if we choose the negative logarithm of a conjugate prior as the penalty function, we can derive EM update rules to obtain the optimal parameter estimates. Regularized density estimates using several hyper-parameters are shown in Figure [*] (middle and right).
  3. Bayesian Sampling
    In a ``fully'' Bayesian approach we compute the predictive distribution by integrating with respect to the posterior distribution. We use a Markov Chain Monte Carlo (MCMC) method for this purpose. In detail, parameter values can be sampled hierarchically using ``data augmentation'' in the Gaussian mixture case.

\begin{figure}
\centerline {\psfig {figure=cir_g0_01.eps,width=.6\textwidth}}\end{figure}

\begin{figure}
\centerline {\psfig {figure=cir_g0_05.eps,width=.6\textwidth}}\end{figure}

Abbildung: Unregularized density estimate and regularized density estimates based on a Gaussian mixture with 40 components.
\begin{figure}
\centerline {\psfig {figure=cir_g0_1.eps,width=.6\textwidth}}\end{figure}

Experiments using artificially generated and real-world medical data give that averaging and maximum penalized likelihood always performed better than the maximum likelihood approach. The Bayesian approach gave good performance on a low-dimensional toy data set but failed on two higher-dimensional problems with ten and six dimensions, respectively. Averaging clearly outperformed maximum penalized likelihood on the medical data set.

Related Publications




next up previous
Nächste Seite: Literatur
Dirk Ormoneit
2000-05-07