Statistical learning suffers from the so-called ``curse-of-dimensionality''
when applied to high-dimensional data:
The proportion of the training
data that lie in a fixed-radius neighborhood of a point decreases to zero at an exponential rate
with increasing dimension of the input space.
While it is impossible to ``break'' the curse-of-dimensionality in
we can frequently reduce the effective number of
dimensions by finding a suitable projection of the original data into
a simplified domain.
In this work, we design a new algorithm that determines this
projection automatically using ``kernel shaping''.
Specifically, we consider local linear estimates
and we estimate the optimal ``shape'' of the
weighting kernel using training data.
This idea is illustrated in
Figures 1 and 2.
The function on the left is sigmoidal along an index vector and constant
in directions orthogonal to . Therefore,
weighting kernel is shrunk
in the direction and stretched in the orthogonal direction (right),
thus minimizing the exposure of the kernel to the nonlinear variation.
Example of a single index model of the
The contours of are straight lines
orthogonal to .
The contours of an optimized weighting kernel is stretched in the orthogonal
We applied kernel shaping to a five- and a ten-dimensional toy example, and a third, real-world example.
On the two artificial data sets,
kernel shaping clearly outperformed local linear regression using a spherical kernel.
led to a
small performance improvement in the real-world example.
Gaussian Mixture Density Estimation
In another project, we consider Gaussian mixture density estimates of
It is well-known that unregularized density estimation using
Gaussian mixtures may well lead to overfitting: In the extreme case, the log-likelihood can go to infinity if the center of one Gaussian coincides with one of the data points and its variance approaches the null-matrix (see Figure , left). I compare three alternatives to deal with this situation:
Averaging can improve the performance of a predictor if the individual models are sufficiently diverse. I compare several resampling schemes to increase diversity including bagging.
Maximum Penalized Likelihood
An alternative approach is to add a penalty term to the log-likelihood function as a
regularizer. The maximum penalized likelihood approach is equivalent
to the maximum a posterior (MAP) parameter estimate in a Bayesian approach
if we interpret the penalty as the negative logarithm of the prior distribution.
In particular, if we choose the negative logarithm of a conjugate prior
as the penalty function,
we can derive EM update rules to obtain
the optimal parameter estimates.
Regularized density estimates using several hyper-parameters are shown in Figure (middle and right).
In a ``fully'' Bayesian approach we compute the predictive distribution by integrating with respect to the posterior
We use a Markov Chain Monte Carlo (MCMC) method for this purpose.
In detail, parameter values can be sampled hierarchically using ``data augmentation'' in the Gaussian mixture case.
estimate and regularized density estimates based on a Gaussian mixture
with 40 components.
Experiments using artificially generated and real-world medical data give that
maximum penalized likelihood always performed better than the
maximum likelihood approach. The Bayesian approach gave good
performance on a low-dimensional toy data set but failed on two
higher-dimensional problems with ten and six
Averaging clearly outperformed maximum penalized likelihood on the medical data set.
D. Ormoneit and V. Tresp.
Averaging, maximum penalized likelihood and Bayesian estimation for
improving Gaussian mixture probability density estimates.IEEE Transactions on Neural Networks, 9(4):639-650,