Irrelevant Features and the Subset Selection Problem

              George John, Ron Kohavi, and Karl Pfleger
                        Computer Science Dept.
                         Stanford University
                          Stanford, CA 94305
               {gjohn,ronnyk,kpfleger}@cs.stanford.edu

We address the problem of finding a subset of features that allows a
supervised induction algorithm to induce small high-accuracy concepts.
We examine notions of relevance and irrelevance, and show that the
definitions used in the machine learning literature do not adequately
partition the features into useful categories of relevance.  We
present definitions for irrelevance and for two degrees of relevance.
These definitions improve our understanding of the behavior of
previous subset selection algorithms, and help define the subset of
features that should be sought.  The features selected should depend
not only on the features and the target concept, but also on the
induction algorithm.  We describe a method for feature subset
selection using cross-validation that is applicable to any induction
algorithm, and discuss experiments conducted with ID3 and C4.5 on
artificial and real datasets.

Citation: 

John, George, Ron Kohavi and Karl Pfleger (1994) Irrelevant Features
and the Subset Selection Problem.  Machine Learning: Proceedings of
the Eleventh International Conference, pp 121--129, Morgan Kaufmann
Publishers, San Francisco.