SIPping from the Data Firehose George H. John Brian Lent IBM Almaden Research Center Stanford University 650 Harry Road Department of Computer Science San Jose, CA 95120 Stanford, CA 94305 {gjohn,lent}@cs.stanford.edu http://www-cs-students.stanford.edu/~{gjohn,lent} When mining large databases, the data extraction problem and the interface between the database and data mining algorithm become important issues. Rather than giving a mining algorithm full access to a database (by extracting to a flat file or other directly-accessible data structure), we propose the SQL Interface Protocol (SIP), which is a framework for interaction between a mining algorithm and a database. The data continues to reside entirely within the database management system (DBMS), but the query interface to the database gives the data mining algorithm sufficient information to discover the same patterns it would have found with direct access to the data. This model of interaction brings several advantages; for example, it allows a mining algorithm to be parallelized automatically just by using a parallelized DBMS to answer queries. We show how two families of mining algorithms may be implemented as ``SIPpers,'' and we discuss related work in databases that should further enhance performance in the future. Citation: George H. John and Brian Lent. SIPping from the data firehose. In D. Heckerman, H. Manilla, and D. Pregibon, editors, _Third International Conference on Knowledge Discovery and Data Mining_, pages 199--202, Menlo Park, CA, 1997. AAAI Press.