Statistical learning suffers from the so-called ``curse-of-dimensionality''
when applied to high-dimensional data:
The proportion of the training
data that lie in a fixed-radius neighborhood of a point decreases to zero at an exponential rate
with increasing dimension of the input space.
While it is impossible to ``break'' the curse-of-dimensionality in
general,
we can frequently reduce the effective number of
dimensions by finding a suitable projection of the original data into
a simplified domain.
In this work, we design a new algorithm that determines this
projection automatically using ``kernel shaping''.
Specifically, we consider local linear estimates
and we estimate the optimal ``shape'' of the
weighting kernel using training data.
This idea is illustrated in
Figures 1 and 2.
The function on the left is sigmoidal along an index vector and constant
in directions orthogonal to . Therefore,
a ``shaped''
weighting kernel is shrunk
in the direction and stretched in the orthogonal direction (right),
thus minimizing the exposure of the kernel to the nonlinear variation.
Abbildung:
Example of a single index model of the
form
with
and
.
Abbildung 2:
The contours of are straight lines
orthogonal to .
The contours of an optimized weighting kernel is stretched in the orthogonal
direction.
We applied kernel shaping to a five- and a ten-dimensional toy example, and a third, real-world example.
On the two artificial data sets,
kernel shaping clearly outperformed local linear regression using a spherical kernel.
Also, it
led to a
small performance improvement in the real-world example.
Gaussian Mixture Density Estimation
In another project, we consider Gaussian mixture density estimates of
high-dimensional datasets.
It is well-known that unregularized density estimation using
Gaussian mixtures may well lead to overfitting: In the extreme case, the log-likelihood can go to infinity if the center of one Gaussian coincides with one of the data points and its variance approaches the null-matrix (see Figure , left). I compare three alternatives to deal with this situation:
Averaging
Averaging can improve the performance of a predictor if the individual models are sufficiently diverse. I compare several resampling schemes to increase diversity including bagging.
Maximum Penalized Likelihood
An alternative approach is to add a penalty term to the log-likelihood function as a
regularizer. The maximum penalized likelihood approach is equivalent
to the maximum a posterior (MAP) parameter estimate in a Bayesian approach
if we interpret the penalty as the negative logarithm of the prior distribution.
In particular, if we choose the negative logarithm of a conjugate prior
as the penalty function,
we can derive EM update rules to obtain
the optimal parameter estimates.
Regularized density estimates using several hyper-parameters are shown in Figure (middle and right).
Bayesian Sampling
In a ``fully'' Bayesian approach we compute the predictive distribution by integrating with respect to the posterior
distribution.
We use a Markov Chain Monte Carlo (MCMC) method for this purpose.
In detail, parameter values can be sampled hierarchically using ``data augmentation'' in the Gaussian mixture case.
Abbildung:
Unregularized density
estimate and regularized density estimates based on a Gaussian mixture
with 40 components.
Experiments using artificially generated and real-world medical data give that
averaging and
maximum penalized likelihood always performed better than the
maximum likelihood approach. The Bayesian approach gave good
performance on a low-dimensional toy data set but failed on two
higher-dimensional problems with ten and six
dimensions, respectively.
Averaging clearly outperformed maximum penalized likelihood on the medical data set.
Related Publications
D. Ormoneit and T. Hastie.
Optimal kernel shapes for local linear regression.
In S. A. Solla, T. K. Leen, and K-R. Müller, editors, Advances
in Neural Information Processing Systems 12. The MIT Press, 2000.
To appear.
D. Ormoneit and V. Tresp.
Averaging, maximum penalized likelihood and Bayesian estimation for
improving Gaussian mixture probability density estimates.IEEE Transactions on Neural Networks, 9(4):639-650,
1998.
D. Ormoneit.
Probability Estimating Neural Networks,
Shaker Verlag, Doctoral Thesis, Institut für Informatik, Technische
Universität München, 1998.
ISBN 3-8265-3723-8.
You can find a summary here.