The recent explosion of on-line information has given rise to a number of query-based search engines and manually constructed topic hierarchies. But with the current rate of growth in the amount of available information, query results grow incomprehensibly large and manual classification in topic hierarchies creates an immense information bottleneck. Therefore, these tools are rapidly becoming inadequate for addressing users' information needs.
We address these problems with a system for topical information space navigation that combines the query-based and taxonomic approaches. Our system, named SONIA (Service for Organizing Networked Information Autonomously), has been implemented as part of the Stanford Digital Libraries testbed. It enables the creation of dynamic hierarchical document categorizations based on the full-text of articles. Using probability theory as a formal foundation, we have developed a number of Machine Learning methods to allow document collections to be automatically organized at a topical level. First, in order to generate such topical hierarchies, we employ a novel probabilistic clustering scheme that outperforms traditional methods used in both Information Retrieval and Probabilistic Reasoning. Furthermore, we have also developed methods for the classification of new articles into such automatically generated, or existing manually generated, hierarchies. In contrast to standard classification approaches which do not make use of the taxonomic relations in a topic hierarchy, our method makes explicit use of the existing hierarchical relationships between topics, leading to improvements in classification accuracy. Much of this improvement is derived from the fact that the classification decisions in such a hierarchy can be made by considering only the presence (or absence) of a small number of features (words) in each document. The choice of relevant words is made using a novel information theoretic algorithm for feature selection. We note that many of the components developed as part of SONIA are general enough that they have been successfully applied to data mining problems in entirely different domains than text.
The integration of the hierarchical clustering and classification methods will allow large amounts of information to be organized and presented to a user in a comprehensible way, one which is tailored to his or her own particular needs. By alleviating the information bottleneck, we hope to provide users with a solution to the problems of information access on the Internet.
The latest implementation of SONIA includes an interface for managing hierarchies of documents using probabilistic clustering and classification methods. This interface in only accessible locally at Stanford.
In a previous incarnation, several of the clustering and classification components in SONIA were accessible through the SenseMaker interface. Through SenseMaker, users can query a number of information sources (including popular Web search engines and Dialog databases) and then use the bundling by "Full-text" feature to employ SONIA for document organization.
The current system has been implemented by Mehran Sahami and Salim Yusufali.
Support for connecting the SenseMaker interface to SONIA was provided by Michelle Baldonado.
The project has benefitted greatly from discussions with Daphne Koller and Marti Hearst.
A paper providing a detailed description of SONIA (including an overview of the clustering, classification, and feature selection methods used in the system) is available at:
Sahami, M., Yusufali, S., and Baldonado, M. Q. W. 1998. SONIA: A Service for Organizing Networked Information Autonomously. To appear in Digital Libraries 98: Proceedings of the Third ACM Conference on Digital Libraries.
Papers related to more specific aspects of the system can be found below.