Enhancements to the Data Mining Process
			    George H. John
			      PhD Thesis
		     Computer Science Department
			 Stanford University
			gjohn@cs.stanford.edu
		 www-cs-students.stanford.edu/~gjohn


Abstract

Data mining is the emerging science and industry of applying modern
statistical and computational technologies to the problem of finding
useful patterns hidden within large databases.  This thesis describes
the data mining process and presents advances and novel methods for
the six steps in the data mining process: extracting data from a
database or data warehouse, cleaning the data, data engineering,
algorithm engineering, data mining, and analyzing the results.

We show how the standard data extraction process can be improved by
building a direct interface between a data-mining algorithm and a
relational database management system.  Next, in data cleaning, we
show how automatically iterating through the data mining process can
identify records that can be profitably ignored during data mining.
For data engineering, we develop an automated way to iterate through
the data mining process to choose the subset of attributes that yields
the best estimated results.  In algorithm engineering, a similar
process is used to automatically set the parameters of a mining
algorithm.

For the data mining algorithms, we study enhancements to
classification tree induction methods and Bayesian methods.  Our new
flexible Bayes data-mining algorithm is fast, understandable, and more
accurate than the standard Bayesian classifier in most situations.  In
classification tree induction we study various univariate splitting
criteria and multivariate partitions.

The analysis of results is necessarily domain-dependent.  In an
example applying data mining to stock selection, we discuss a key
requirement in real-world applications: using appropriate
domain-dependent methods to evaluate the proposed solution.


Table of Contents

Chapter 1   What is Data Mining?________________________________1
  1.1 Definitions ______________________________________________2 
  1.2 The Data Mining Process __________________________________5 
    1.2.1 Understand and Define the Problem ____________________6 
    1.2.2 Data Collection and Extraction _______________________7 
    1.2.3 Data Cleaning and Exploration _______________________11 
    1.2.4 Data Engineering ____________________________________13 
    1.2.5 Algorithm Engineering _______________________________15 
    1.2.6 Running the Data Mining Algorithm ___________________16 
    1.2.7 Preliminary Evaluation of Results ___________________17 
    1.2.8 Refining the Data and Problem _______________________20 
    1.2.9 Using the Results ___________________________________22 
  1.3 Types of Data Mining Patterns ___________________________23 
    1.3.1 Association Rules ___________________________________24 
    1.3.2 Clustering __________________________________________26 
    1.3.3 Regression __________________________________________28 
    1.3.4 Classification ______________________________________29 
  1.4 Summary of Contributions ________________________________31 
Chapter 2   Data Extraction: Interfacing with a Database ______35 
  2.1 Introduction ____________________________________________36 
  2.2 Information Requirements and SIP ________________________38 
  2.3 SIP-Compliant Algorithms ________________________________41 
    2.3.1 The C4.5 Classification Tree Algorithm ______________41 
    2.3.2 Simple Bayes ________________________________________45 
  2.4 Related Work ____________________________________________47 
  2.5 Summary _________________________________________________50 
Chapter 3   Data Cleaning _____________________________________51 
  3.1 Introduction ____________________________________________51 
  3.2 Robust Statistical Methods ______________________________53 
  3.3 C4.5 and Pruning: Model Selection _______________________55 
  3.4 Esteemed C4.5: Data Selection ___________________________56 
  3.5 Experimental Evaluation of Esteem _______________________58 
    3.5.1 Experimental Method _________________________________58 
    3.5.2 Experimental Results ________________________________59 
  3.6 Related Work ____________________________________________63 
  3.7 Conclusion ______________________________________________65 
Chapter 4   Data Engineering: Attribute Subset Selection ______67 
  4.1 Introduction ____________________________________________68 
  4.2 The Attribute Subset Selection Problem __________________69 
  4.3 Relevance _______________________________________________71 
  4.4 Filter Methods __________________________________________76 
  4.5 The Wrapper Method ______________________________________79 
  4.6 Experimental Evaluation _________________________________82 
  4.7 Related and Future Work _________________________________86 
  4.8 Future Work _____________________________________________88 
  4.9 Conclusion ______________________________________________89 
Chapter 5   Algorithm Engineering: Parameter Tuning ___________91 
  5.1 Introduction ____________________________________________92 
  5.2 The Parameter Selection Problem _________________________93 
  5.3 The Wrapper Method for Parameter Selection ______________94 
    5.3.1 Error Estimation by Cross Validation ________________95 
    5.3.2 Optimization using Best-First Search ________________95 
  5.4 Automatic Parameter Selection for C4.5 __________________97 
  5.5 Experiments with Tuning C4.5 ____________________________99 
    5.5.1 Experimental Method _________________________________99 
    5.5.2 Experimental Results _______________________________100 
  5.6 Related Work ___________________________________________103 
  5.7 Conclusion _____________________________________________105 
Chapter 6   Flexible Bayesian Classifiers ____________________107 
  6.1 Introduction ___________________________________________108 
  6.2 The Simple Bayesian Classifier _________________________109 
  6.3 Flexible Simple Bayes __________________________________110 
  6.4 Theoretical Properties of Flexible Bayes _______________113 
  6.5 Experimental Studies ___________________________________115 
    6.5.1 Experiments on Natural Data ________________________116 
    6.5.2 Artificial Domains _________________________________118 
  6.6 Discussion _____________________________________________120 
  6.7 Conclusion _____________________________________________123 
Chapter 7   Trees: Greedy Attribute Selection ________________125 
  7.1 Introduction ___________________________________________126 
  7.2 Definitions of Splitting Criteria ______________________127 
    7.2.1 Existing Attribute Selection Criteria ______________129 
    7.2.2 New Attribute Selection Criteria ___________________132 
    7.2.3 The $\Delta $ Criterion ____________________________132 
    7.2.4 The Pessimistic Error Estimate Criterion ___________134 
    7.2.5 Cross Validation-Selected Splitting Criteria _______135 
  7.3 Experiments ____________________________________________135 
    7.3.1 Methodology ________________________________________136 
    7.3.2 Experimental Results _______________________________136 
  7.4 Related Work ___________________________________________139 
  7.5 Conclusions ____________________________________________140 
Chapter 8   Trees: Multivariate Partitioning _________________143 
  8.1 Introduction ___________________________________________143 
  8.2 Finding Splitting Functions ____________________________145 
  8.3 Building Trees with Soft Entropy _______________________148 
    8.3.1 Why *Soft* Entropy? ________________________________148 
    8.3.2 Why Soft *Entropy*? ________________________________149 
    8.3.3 Finding the Best Discriminant ______________________149 
  8.4 Pruning a Tree _________________________________________151 
  8.5 Experiments ____________________________________________151 
    8.5.1 Methodology ________________________________________151 
    8.5.2 Results and Discussion _____________________________152 
  8.6 Related Work ___________________________________________153 
  8.7 Future Work ____________________________________________155 
  8.8 Conclusion _____________________________________________155 
Chapter 9   Analyzing Results: Stock Selection _______________157 
  9.1 Introduction ___________________________________________158 
  9.2 Applying the Data Mining Process _______________________159 
  9.3 Analyzing Results ______________________________________162 
  9.4 Long/Short Portfolios __________________________________168 
  9.5 Concluding Remarks _____________________________________170 
Chapter 10  The Future of Data Mining ________________________171 
  10.1 Data Mining for the Rest of Us ________________________171 
  10.2 Fertile Technological Developments ____________________174 
  10.3 Conclusion ____________________________________________176 
Bibliography ________________________________________________ 179


Citation:
George H. John. _Enhancements to the Data Mining Process_,
PhD Thesis, Computer Science Department, School of Engineering,
Stanford University, 194pp., March 1997.
Order from UMI (800) 521-0600 Dissertation #9723376