Irrelevant Features and the Subset Selection Problem George John, Ron Kohavi, and Karl Pfleger Computer Science Dept. Stanford University Stanford, CA 94305 {gjohn,ronnyk,kpfleger}@cs.stanford.edu We address the problem of finding a subset of features that allows a supervised induction algorithm to induce small high-accuracy concepts. We examine notions of relevance and irrelevance, and show that the definitions used in the machine learning literature do not adequately partition the features into useful categories of relevance. We present definitions for irrelevance and for two degrees of relevance. These definitions improve our understanding of the behavior of previous subset selection algorithms, and help define the subset of features that should be sought. The features selected should depend not only on the features and the target concept, but also on the induction algorithm. We describe a method for feature subset selection using cross-validation that is applicable to any induction algorithm, and discuss experiments conducted with ID3 and C4.5 on artificial and real datasets. Citation: John, George, Ron Kohavi and Karl Pfleger (1994) Irrelevant Features and the Subset Selection Problem. Machine Learning: Proceedings of the Eleventh International Conference, pp 121--129, Morgan Kaufmann Publishers, San Francisco.