Robust Decision Trees: Removing Outliers from Databases

			     George John
                        Computer Science Dept.
                         Stanford University
                          Stanford, CA 94305
			gjohn@cs.stanford.edu

  Finding and removing outliers is an important problem in data
  mining.  Errors in large databases can be extremely common, so an
  important property of a data mining algorithm is robustness
  with respect to errors in the database.  Most sophisticated
  methods in machine learning address this problem to some extent,
  but not fully, and can be improved by addressing the problem more
  directly.  In this paper we examine C4.5, a decision tree algorithm
  that is already quite robust -- few algorithms have been shown to
  consistently achieve higher accuracy.  C4.5 incorporates a pruning
  scheme that partially addresses the outlier removal problem.  In
  our Robust-C4.5 algorithm we extend the pruning method to fully
  remove the effect of outliers, and this results in improvement on
  many databases.

Citation: 
George H. John, Robust Decision Trees: Removing Outliers in Databases.
In U. M. Fayyad and R. Uthurusamy, editors, _Proceedings of the First
International Conference on Knowledge Discovery and Data Mining_,
pages 174-179, AAAI Press, Menlo Park, CA, 1995.