datamin

Nächste Seite: Literatur

Dirk Ormoneit
Trevor Hastie
Volker Tresp

Data Mining

High-Dimensional Regression

Statistical learning suffers from the so-called ``curse-of-dimensionality'' when applied to high-dimensional data: The proportion of the training data that lie in a fixed-radius neighborhood of a point decreases to zero at an exponential rate with increasing dimension of the input space. While it is impossible to ``break'' the curse-of-dimensionality in general, we can frequently reduce the effective number of dimensions by finding a suitable projection of the original data into a simplified domain. In this work, we design a new algorithm that determines this projection automatically using ``kernel shaping''. Specifically, we consider local linear estimates and we estimate the optimal ``shape'' of the weighting kernel using training data. This idea is illustrated in Figures 1 and 2. The function on the left is sigmoidal along an index vector $\kappa$ and constant in directions orthogonal to $\kappa$ . Therefore, a ``shaped'' weighting kernel is shrunk in the direction $\kappa$ and stretched in the orthogonal direction (right), thus minimizing the exposure of the kernel to the nonlinear variation.

**Abbildung:** Example of a single index model of the form $y = g(x'\kappa)$ with $\kappa = (1,1)$ and $g(z) = \tanh(3z)$ .
$\begin{figure} \centerline {\psfig {figure=singidx.eps,width=.8\textwidth}}\par\end{figure}$

**Abbildung 2:** The contours of are straight lines orthogonal to $\kappa$ . The contours of an optimized weighting kernel is stretched in the orthogonal direction.
$\begin{figure} \centerline {\psfig {figure=singctur.eps,width=.8\textwidth}}\par\end{figure}$

We applied kernel shaping to a five- and a ten-dimensional toy example, and a third, real-world example. On the two artificial data sets, kernel shaping clearly outperformed local linear regression using a spherical kernel. Also, it led to a small performance improvement in the real-world example.

Gaussian Mixture Density Estimation

In another project, we consider Gaussian mixture density estimates of high-dimensional datasets. It is well-known that unregularized density estimation using Gaussian mixtures may well lead to overfitting: In the extreme case, the log-likelihood can go to infinity if the center of one Gaussian coincides with one of the data points and its variance approaches the null-matrix (see Figure , left). I compare three alternatives to deal with this situation:

Averaging
Averaging can improve the performance of a predictor if the individual models are sufficiently diverse. I compare several resampling schemes to increase diversity including bagging.
Maximum Penalized Likelihood
An alternative approach is to add a penalty term to the log-likelihood function as a regularizer. The maximum penalized likelihood approach is equivalent to the maximum a posterior (MAP) parameter estimate in a Bayesian approach if we interpret the penalty as the negative logarithm of the prior distribution. In particular, if we choose the negative logarithm of a conjugate prior as the penalty function, we can derive EM update rules to obtain the optimal parameter estimates. Regularized density estimates using several hyper-parameters are shown in Figure (middle and right).
Bayesian Sampling
In a ``fully'' Bayesian approach we compute the predictive distribution by integrating with respect to the posterior distribution. We use a Markov Chain Monte Carlo (MCMC) method for this purpose. In detail, parameter values can be sampled hierarchically using ``data augmentation'' in the Gaussian mixture case.

$\begin{figure} \centerline {\psfig {figure=cir_g0_01.eps,width=.6\textwidth}}\end{figure}$

$\begin{figure} \centerline {\psfig {figure=cir_g0_05.eps,width=.6\textwidth}}\end{figure}$

**Abbildung:** Unregularized density estimate and regularized density estimates based on a Gaussian mixture with 40 components.
$\begin{figure} \centerline {\psfig {figure=cir_g0_1.eps,width=.6\textwidth}}\end{figure}$

Experiments using artificially generated and real-world medical data give that averaging and maximum penalized likelihood always performed better than the maximum likelihood approach. The Bayesian approach gave good performance on a low-dimensional toy data set but failed on two higher-dimensional problems with ten and six dimensions, respectively. Averaging clearly outperformed maximum penalized likelihood on the medical data set.

Related Publications

D. Ormoneit and T. Hastie. Optimal kernel shapes for local linear regression. In S. A. Solla, T. K. Leen, and K-R. Müller, editors, Advances in Neural Information Processing Systems 12. The MIT Press, 2000. To appear.
D. Ormoneit and V. Tresp. Averaging, maximum penalized likelihood and Bayesian estimation for improving Gaussian mixture probability density estimates. IEEE Transactions on Neural Networks, 9(4):639-650, 1998.
D. Ormoneit. Probability Estimating Neural Networks, Shaker Verlag, Doctoral Thesis, Institut für Informatik, Technische Universität München, 1998. ISBN 3-8265-3723-8.
You can find a summary here.
R. Neuneier, F. Hergert, W. Finnoff, and D. Ormoneit. Estimation of Conditional Densities: A Comparison of Neural Network Approaches
Proceedings of the International Conference on Artificial Neural Networks, 1:689-692, 1994.

Nächste Seite: Literatur

Dirk Ormoneit
2000-05-07