Enhancements to the Data Mining Process George H. John PhD Thesis Computer Science Department Stanford University gjohn@cs.stanford.edu www-cs-students.stanford.edu/~gjohn Abstract Data mining is the emerging science and industry of applying modern statistical and computational technologies to the problem of finding useful patterns hidden within large databases. This thesis describes the data mining process and presents advances and novel methods for the six steps in the data mining process: extracting data from a database or data warehouse, cleaning the data, data engineering, algorithm engineering, data mining, and analyzing the results. We show how the standard data extraction process can be improved by building a direct interface between a data-mining algorithm and a relational database management system. Next, in data cleaning, we show how automatically iterating through the data mining process can identify records that can be profitably ignored during data mining. For data engineering, we develop an automated way to iterate through the data mining process to choose the subset of attributes that yields the best estimated results. In algorithm engineering, a similar process is used to automatically set the parameters of a mining algorithm. For the data mining algorithms, we study enhancements to classification tree induction methods and Bayesian methods. Our new flexible Bayes data-mining algorithm is fast, understandable, and more accurate than the standard Bayesian classifier in most situations. In classification tree induction we study various univariate splitting criteria and multivariate partitions. The analysis of results is necessarily domain-dependent. In an example applying data mining to stock selection, we discuss a key requirement in real-world applications: using appropriate domain-dependent methods to evaluate the proposed solution. Table of Contents Chapter 1 What is Data Mining?________________________________1 1.1 Definitions ______________________________________________2 1.2 The Data Mining Process __________________________________5 1.2.1 Understand and Define the Problem ____________________6 1.2.2 Data Collection and Extraction _______________________7 1.2.3 Data Cleaning and Exploration _______________________11 1.2.4 Data Engineering ____________________________________13 1.2.5 Algorithm Engineering _______________________________15 1.2.6 Running the Data Mining Algorithm ___________________16 1.2.7 Preliminary Evaluation of Results ___________________17 1.2.8 Refining the Data and Problem _______________________20 1.2.9 Using the Results ___________________________________22 1.3 Types of Data Mining Patterns ___________________________23 1.3.1 Association Rules ___________________________________24 1.3.2 Clustering __________________________________________26 1.3.3 Regression __________________________________________28 1.3.4 Classification ______________________________________29 1.4 Summary of Contributions ________________________________31 Chapter 2 Data Extraction: Interfacing with a Database ______35 2.1 Introduction ____________________________________________36 2.2 Information Requirements and SIP ________________________38 2.3 SIP-Compliant Algorithms ________________________________41 2.3.1 The C4.5 Classification Tree Algorithm ______________41 2.3.2 Simple Bayes ________________________________________45 2.4 Related Work ____________________________________________47 2.5 Summary _________________________________________________50 Chapter 3 Data Cleaning _____________________________________51 3.1 Introduction ____________________________________________51 3.2 Robust Statistical Methods ______________________________53 3.3 C4.5 and Pruning: Model Selection _______________________55 3.4 Esteemed C4.5: Data Selection ___________________________56 3.5 Experimental Evaluation of Esteem _______________________58 3.5.1 Experimental Method _________________________________58 3.5.2 Experimental Results ________________________________59 3.6 Related Work ____________________________________________63 3.7 Conclusion ______________________________________________65 Chapter 4 Data Engineering: Attribute Subset Selection ______67 4.1 Introduction ____________________________________________68 4.2 The Attribute Subset Selection Problem __________________69 4.3 Relevance _______________________________________________71 4.4 Filter Methods __________________________________________76 4.5 The Wrapper Method ______________________________________79 4.6 Experimental Evaluation _________________________________82 4.7 Related and Future Work _________________________________86 4.8 Future Work _____________________________________________88 4.9 Conclusion ______________________________________________89 Chapter 5 Algorithm Engineering: Parameter Tuning ___________91 5.1 Introduction ____________________________________________92 5.2 The Parameter Selection Problem _________________________93 5.3 The Wrapper Method for Parameter Selection ______________94 5.3.1 Error Estimation by Cross Validation ________________95 5.3.2 Optimization using Best-First Search ________________95 5.4 Automatic Parameter Selection for C4.5 __________________97 5.5 Experiments with Tuning C4.5 ____________________________99 5.5.1 Experimental Method _________________________________99 5.5.2 Experimental Results _______________________________100 5.6 Related Work ___________________________________________103 5.7 Conclusion _____________________________________________105 Chapter 6 Flexible Bayesian Classifiers ____________________107 6.1 Introduction ___________________________________________108 6.2 The Simple Bayesian Classifier _________________________109 6.3 Flexible Simple Bayes __________________________________110 6.4 Theoretical Properties of Flexible Bayes _______________113 6.5 Experimental Studies ___________________________________115 6.5.1 Experiments on Natural Data ________________________116 6.5.2 Artificial Domains _________________________________118 6.6 Discussion _____________________________________________120 6.7 Conclusion _____________________________________________123 Chapter 7 Trees: Greedy Attribute Selection ________________125 7.1 Introduction ___________________________________________126 7.2 Definitions of Splitting Criteria ______________________127 7.2.1 Existing Attribute Selection Criteria ______________129 7.2.2 New Attribute Selection Criteria ___________________132 7.2.3 The $\Delta $ Criterion ____________________________132 7.2.4 The Pessimistic Error Estimate Criterion ___________134 7.2.5 Cross Validation-Selected Splitting Criteria _______135 7.3 Experiments ____________________________________________135 7.3.1 Methodology ________________________________________136 7.3.2 Experimental Results _______________________________136 7.4 Related Work ___________________________________________139 7.5 Conclusions ____________________________________________140 Chapter 8 Trees: Multivariate Partitioning _________________143 8.1 Introduction ___________________________________________143 8.2 Finding Splitting Functions ____________________________145 8.3 Building Trees with Soft Entropy _______________________148 8.3.1 Why *Soft* Entropy? ________________________________148 8.3.2 Why Soft *Entropy*? ________________________________149 8.3.3 Finding the Best Discriminant ______________________149 8.4 Pruning a Tree _________________________________________151 8.5 Experiments ____________________________________________151 8.5.1 Methodology ________________________________________151 8.5.2 Results and Discussion _____________________________152 8.6 Related Work ___________________________________________153 8.7 Future Work ____________________________________________155 8.8 Conclusion _____________________________________________155 Chapter 9 Analyzing Results: Stock Selection _______________157 9.1 Introduction ___________________________________________158 9.2 Applying the Data Mining Process _______________________159 9.3 Analyzing Results ______________________________________162 9.4 Long/Short Portfolios __________________________________168 9.5 Concluding Remarks _____________________________________170 Chapter 10 The Future of Data Mining ________________________171 10.1 Data Mining for the Rest of Us ________________________171 10.2 Fertile Technological Developments ____________________174 10.3 Conclusion ____________________________________________176 Bibliography ________________________________________________ 179 Citation: George H. John. _Enhancements to the Data Mining Process_, PhD Thesis, Computer Science Department, School of Engineering, Stanford University, 194pp., March 1997. Order from UMI (800) 521-0600 Dissertation #9723376