Dirk Ormoneit

Trevor Hastie

Volker Tresp

** Data Mining**

** High-Dimensional Regression**

Statistical learning suffers from the so-called ``curse-of-dimensionality'' when applied to high-dimensional data: The proportion of the training data that lie in a fixed-radius neighborhood of a point decreases to zero at an exponential rate with increasing dimension of the input space. While it is impossible to ``break'' the curse-of-dimensionality in general, we can frequently reduce the effective number of dimensions by finding a suitable projection of the original data into a simplified domain. In this work, we design a new algorithm that determines this projection automatically using ``kernel shaping''. Specifically, we consider local linear estimates and we estimate the optimal ``shape'' of the weighting kernel using training data. This idea is illustrated in Figures 1 and 2. The function on the left is sigmoidal along an index vector and constant in directions orthogonal to . Therefore, a ``shaped'' weighting kernel is shrunk in the direction and stretched in the orthogonal direction (right), thus minimizing the exposure of the kernel to the nonlinear variation.

We applied kernel shaping to a five- and a ten-dimensional toy example, and a third, real-world example. On the two artificial data sets, kernel shaping clearly outperformed local linear regression using a spherical kernel. Also, it led to a small performance improvement in the real-world example.

** Gaussian Mixture Density Estimation**

In another project, we consider Gaussian mixture density estimates of high-dimensional datasets. It is well-known that unregularized density estimation using Gaussian mixtures may well lead to overfitting: In the extreme case, the log-likelihood can go to infinity if the center of one Gaussian coincides with one of the data points and its variance approaches the null-matrix (see Figure , left). I compare three alternatives to deal with this situation:

**Averaging**

Averaging can improve the performance of a predictor if the individual models are sufficiently diverse. I compare several resampling schemes to increase diversity including*bagging*.**Maximum Penalized Likelihood**

An alternative approach is to add a penalty term to the log-likelihood function as a regularizer. The maximum penalized likelihood approach is equivalent to the maximum a posterior (MAP) parameter estimate in a Bayesian approach if we interpret the penalty as the negative logarithm of the prior distribution. In particular, if we choose the negative logarithm of a*conjugate*prior as the penalty function, we can derive EM update rules to obtain the optimal parameter estimates. Regularized density estimates using several hyper-parameters are shown in Figure (middle and right).**Bayesian Sampling**

In a ``fully'' Bayesian approach we compute the predictive distribution by integrating with respect to the posterior distribution. We use a Markov Chain Monte Carlo (MCMC) method for this purpose. In detail, parameter values can be sampled hierarchically using ``data augmentation'' in the Gaussian mixture case.

** Related Publications**

- D. Ormoneit and T. Hastie.
Optimal kernel shapes for local linear regression.
In S. A. Solla, T. K. Leen, and K-R. Müller, editors,
*Advances in Neural Information Processing Systems 12*. The MIT Press, 2000. To appear. - D. Ormoneit and V. Tresp.
**Averaging, maximum penalized likelihood and Bayesian estimation for improving Gaussian mixture probability density estimates.**IEEE Transactions on Neural Networks, 9(4):639-650, 1998. - D. Ormoneit.
Probability Estimating Neural Networks,
Shaker Verlag, Doctoral Thesis, Institut für Informatik, Technische
Universität München, 1998.
ISBN 3-8265-3723-8.

You can find a summary here. - R. Neuneier, F. Hergert, W. Finnoff, and D. Ormoneit.
Estimation of Conditional Densities:
A Comparison of Neural Network Approaches
*Proceedings of the International Conference on Artificial Neural Networks*, 1:689-692, 1994.