Static Versus Dynamic Sampling for Data Mining George H. John Pat Langley Computer Science Department Stanford University Stanford, CA 94305 {gjohn,langley}@cs.stanford.edu http://robotics.stanford.edu/~{gjohn,langley}/ As data warehouses grow to the point where one hundred gigabytes is considered small, the computational efficiency of data-mining algorithms on large databases becomes increasingly important. Using a sample from the database can speed up the data-mining process, but this is only acceptable if it does not reduce the quality of the mined knowledge. To this end, we introduce the ``Probably Close Enough'' criterion to describe the desired properties of a sample. Sampling usually refers to the use of *static* statistical tests to decide whether a sample is sufficiently similar to the large database, in the absence of any knowledge of the tools the data miner intends to use. We discuss *dynamic* sampling methods, which take into account the mining tool being used and can thus give better samples. We describe dynamic schemes that observe a mining tool's performance on training samples of increasing size and use these results to determine when a sample is sufficiently large. We evaluate these sampling methods on data from the UCI repository and conclude that dynamic sampling is preferable. Citation: George H. John and Pat Langley. Static versus dynamic sampling for data mining. In E. Simoudis, J-W. Han, and U. Fayyad, editors, Proceedings, Second International Conference on Knowledge Discovery and Data Mining, pages 367--370, Menlo Park, CA, 1996. AAAI Press.