Static Versus Dynamic Sampling for Data Mining

	      George H. John                 Pat Langley
		     Computer Science Department
			 Stanford University
			  Stanford, CA 94305
		   {gjohn,langley}@cs.stanford.edu
	   http://robotics.stanford.edu/~{gjohn,langley}/

  As data warehouses grow to the point where one hundred gigabytes is
  considered small, the computational efficiency of data-mining
  algorithms on large databases becomes increasingly important.  Using a
  sample from the database can speed up the data-mining process, but
  this is only acceptable if it does not reduce the quality of the mined
  knowledge.  To this end, we introduce the ``Probably Close Enough''
  criterion to describe the desired properties of a sample.  Sampling
  usually refers to the use of *static* statistical tests to decide
  whether a sample is sufficiently similar to the large database, in the
  absence of any knowledge of the tools the data miner intends to use.
  We discuss *dynamic* sampling methods, which take into account the
  mining tool being used and can thus give better samples.  We describe
  dynamic schemes that observe a mining tool's performance on training
  samples of increasing size and use these results to determine when a
  sample is sufficiently large.  We evaluate these sampling methods on
  data from the UCI repository and conclude that dynamic sampling is
  preferable.
  

Citation: 
George H. John and Pat Langley. Static versus dynamic sampling for
data mining.  In E. Simoudis, J-W. Han, and U. Fayyad, editors,
Proceedings, Second International Conference on Knowledge Discovery
and Data Mining, pages 367--370, Menlo Park, CA, 1996. AAAI Press.