George H. John Annotated Bibliography of Linear Discriminant Decision Trees Posted to the comp.ai.neural-nets Newsgroup May, 1994 gjohn@cs.stanford.edu ftp://starry.stanford.edu/pub/gjohn/dtbib.txt I was thinking of writing up a tech report comparing all these things, but since I don't know when I'll write it I'm casting this to the winds of the USENET now. Note: there's quite a lot of good stuff in this article (long bibliography -- 30 refs) so I would like to ask you to cite me if you use it. >>>>> Tom Dietterich writes: > The original CART book gave an algorithm for learning decision trees > with linear decision boundaries at each decision node of the tree. > See Breiman, L, Friedman, J. H., Olshen, R. A., and Stone, C. J. > (1984). Classification and Regression Trees. Monterey, CA: Wadsworth > and Brooks. That excellent book should be required reading for anyone in machine learning and neural nets. However, the previous poster asked for the original paper which proposed recursive partitioning classifiers with linear discriminant functions as the device used for partitioning. I recently thought of such an algorithm myself and did a fairly extensive review of the related literature. I'll list the references I have, including what I believe to be the first, with a brief comment on each. THE BIBLIOGRAPHY (These are ordered somewhat randomly though some of the best are at the top.) I'd like to thank Richard Brent, whose excellent bibliography started me on this whole crusade. A very good paper, discusses all the important issues: missing values (he suggests tokens -- see Quinlan's ID3 paper), multiclass discrimination, minimizing loss function instead of just #misclassified points, lookahead. (Note he went on to co-author the CART book.) @Article{FriedmanLinearTrees, author = "Jerome H. Friedman", title = "A Recursive Partitioning Decision Rule for Nonparametric Classification", journal = "IEEE Transactions on Computers", year = 1977, pages = "404--408", month = "April" } This is the earliest I know of. They used linear splits, selecting the split corresponding to the maximum eigenvector of the covariance matrix. @Article{HenrichonNonparam, author = {Ernest G. {Henrichon, Jr.} and King-Sun Fu}, title = "A Nonparametric Partitioning Procedure for Pattern Classification", journal = "IEEE Transactions on Computers", year = 1969, volume = "C-18", number = 7, pages = "614--624", month = "July" } Uses k-means clustering to find splits @Article{FuKMeansDT, author = "Y. K. Lin and K. S. Fu", title = "Automatic Classification of Cervical Cells Using a Binary Tree Classifier", journal = "Pattern Recognition", year = 1983, volume = 16, number = 1, pages = "69--80" } Grab this software! It's a nice paper which discusses a small extension to CART which apparently improves it quite a bit. (I'd like to see someone compare their OC1 software with the actual CART software though.) @InProceedings{MurthyOC1, author = "Sreerama Murthy and Simon Kasif and Steven Salzberg and Richard Beigel", title = "{OC1}: Randomized Induction of Oblique Decision Trees", crossref = "AAAI93", pages = "322--327" } @Misc{OC1software, author = "Sreerama K. Murthy and Steven Salzberg and Simon Kasif", title = "{OC1}", howpublished = "Available by anonymous ftp in {\tt blaze.cs.jhu.edu:pub/oc1}", year = 1993 } An early ID3-lookalike (doesn't use linear splits but included for historical significance). The back of the book contains instructions on how to get the program from them on punchcards and how to feed it into your computer! @Book{THAID, author = "James N. Morgan and Robert C. Messenger", title = "{THAID}: a sequential analysis program for the analysis of nominal scale dependent variables", publisher = "University of Michigan", year = 1973, } Many of these algorithms are actually decision trees/lists although they're presented as neural net algorithms. Their objective function has some undesirable properties and they tend to make unsupported optimality claims but I really like their approach. @TechReport{MangasarianLPNNTR, author = "Kristin P. Bennett and Olvi L. Mangasarian", title = "Neural Network Training Via Linear Programming", institution = "Center for Parallel Optimization, Computer Science Dept., University of Wisconsin", year = 1990, number = "948", address = "Madison, WI", month = "July" } @InProceedings{MangasarianLPNN, author = "Kristin P. Bennett and Olvi L. Mangasarian", title = "Neural Network Training Via Linear Programming", booktitle = "Advances in Optimization and Parallel Computing", year = 1992, editor = "P. M. Pardalos", pages = "56--67", address = "North Holland, Amsterdam" } @article{MangasarianRobust, author = "K. P. Bennett and O. L. Mangasarian", title = "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", journal = "Optimization Methods and Software", year = 1992, volume = 1, pages = {23-34} } Forms a decision tree using mean-squared error as the objective function for the hyperplanes. @InProceedings{SankarNTNVowel, author = "Ananth Sankar and Richard J. Mammone", title = "Speaker Independent Vowel Recognition using Neural Tree Networks", crossref = "IJCNN91-SEATTLE", pages = "II: 809--14" } @InProceedings{SankarNTNPrune, author = "Ananth Sankar and Richard J. Mammone", title = "Optimal Pruning of Neural Tree Networks for Improved Generalization", crossref = "IJCNN91-SEATTLE", pages = "II: 219--224" } There's a forthcoming article (or has it already come forth?) in the Machine Learning Journal by Carla Brodley which seems to repeat much of Sankar's work. Both of them even tried out Freans "Thermal Perceptrons" to learn their trees faster! It's a shame that "ML" people don't read Neural Net papers... Presents a decision tree with hyperplane splits, claiming their algorithm is better than CART because it's faster. Read the commentary by Breiman and Friedman for a fun time! @Article{LohTreeStructured, author = "Wei-Yin Loh and Nunta Vanichsetakul", title = "Tree-Structured Classification Via Generalized Discriminant Analysis", journal = "Journal of the American Statistical Association", year = 1988, volume = 83, number = 403, pages = "715--725", month = "September" } @Misc{CARTOnLoh, author = "Leo Breiman and Jerome H. Friedman", title = "Comment on {\it Tree-Structured Classification...}", journal = "Journal of the American Statistical Association", year = 1988, volume = 83, number = 403, pages = "715--725", month = "September" } Finds hyperplanes, based on regression. @Article{FuBinaryTree, author = "Shi Qing-Yun and K. S. Fu", title = "A Method for the Design of Binary-Tree Classifiers", journal = "Pattern Recognition", year = 1983, volume = 16, number = 6, pages = "593--603" } Uses the EM approach to find an optimal tree, eschewing the common greedy tree-building approach @InProceedings{JordanHierarchiesML, author = "Michael I. Jordan and Robert A. Jacobs", title = "Supervised Learning and divide-and-conquer: A statistical approach", crossref = "ML93" } Discusses the mapping between trees and neural nets. Gives an algorihtm for finding linear splits based on a measure somehow related to entropy but that I still don't understand. Used conjugate gradient as the optimizer. @Article{BrentTNN91, author = "Richard P. Brent", title = "Fast Training Algorithms for Neural Networks", journal = "IEEE Transactions on Neural Networks", year = 1991, volume = 2, number = 3, pages = "346--354", month = "May" } Uses Littlestone's winnow to learn quickly with lots of irrelevant features. Uses the #errors as the splitting criterion to be minimized. @InProceedings{Sahami93, author = "Mehran Sahami", title = "Learning Non-Linearly Separable Boolean Functions with Linear Threshold Unit Trees and Madaline-Style Networks", crossref = "AAAI93" } Nice review of splitting criteria but results (that one can use a random splitting rule and do just as well as with entropy) are incorrect, based on a serious methodological flaw. (The datasets used for pruning and for final tree evaluation were the same.) @Article{MingersSplitCriteria, author = "John Mingers", title = "An Empirical Comparison of Selection Measures for Decision-Tree Induction", journal = "Machine Learning", year = 1989, volume = 3, pages = "319--342" } General decision tree paper, shows that Mingers was wrong -- splitting criterion is important after all. Not enough people have read this paper!! (In particular, a reviewer rejected a paper of mine on the basis of Mingers' results, which these guys show was based on serious methodological errors.) @Article{BuntineSplitCriteria, author = "Wray Buntine and Tim Niblett", title = "A Further Comparison of Splitting Rules for Decision-Tree Induction", journal = "Machine Learning", year = 1992, volume = 8, pages = "75--85" } This is exactly the ID3 algorithm published a few years before Quinlan's version in 86. To be fair this deals only with reals not discrete variables. (But discrete variables were dealt with in THAID) @Article{SethiID3, author = "I. K. Sethi and G. P. R. Sarvarayudu", title = "Hierarchical Classifier Design Using Mutual Information", journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence", year = 1982, volume = "PAMI-4", number = 4, pages = "441-445", month = "July" } An EXCELLENT review of decision trees. Yes its old but don't let that stop you. @Article{Moret82, author = "Bernard M. Moret", title = "Decision Trees and Diagrams", journal = "ACM Computing Surveys", year = 1982, volume = 14, number = 4, pages = "593--623" } Another ID3-lookalike @InProceedings{TalmonDT, author = "Jan L. Talmon", title = "A Multiclass Nonparametric Partitioning Algorithm", booktitle = "Pattern Recognition in Practice II", year = 1986, editor = "E. S. Gelsema and L. N. Kanal", pages = "449--459", publisher = "Elsevier Science Publishers" } Some bad news for us decision-tree builders... @Article{RivestDTNP, author = "Laurent Hyafil and Ronald L. Rivest", title = "Constructing Optimal Binary Decision Trees is NP-Complete", journal = "Information Processing Letters", year = 1976, volume = 5, number = 1, pages = "15--17", month = "May" } Ah, we are saved! But this algorithm is only practical for small datasets-- Hyafil and Rivest weren't wrong. @Article{PayneOptDT, author = "Harold J. Payne and William S. Meisel", title = "An Algorithm for Constructing Optimal Binary Decision Trees", journal = "IEEE Transactions on Computers", year = 1977, volume = "C-26", number = 9, pages = "905--916", month = "September" } This uses fishers' linear discriminant to partition the data, which is what Friedman did in 1977. They don't cite him :( @InProceedings{KoutsougerasEntropyNet, author = "C. Koutsougeras and C. A. Papachristou", title = "Training of a Neural Network for Pattern Classification Based on an Entropy Measure", crossref = "IEEENN88", pages = "247--254" } Their hyperplanes are trained using the perceptron error correction rule. Incremental. @Article{AdaptiveLinearFilter, author = "S. B. Gelfand and C. S. Ravishankar", title = "A Tree-structured Piecewise-Linear Adaptive Filter", journal = "IEEE Transactions on Information Theory", year = 1993, volume = 39, number = 6, pages = "1907--1922", month = "November" } I think these were the first to discuss the mapping between decision trees and neural nets. Their splitting functions were univariate, I think. @InProceedings{SethiEntropyNets, author = "Ishwar K. Sethi and Mike Otten", title = "Comparison between entropy nets and decision trees", crossref = "IJCNN90", volume = 3, pages = "63--68" } @Article{SethiEntropyNetsIEEE, author = "Ishwar K. Sethi", title = "Entropy Nets: from Decision Trees to Neural Networks", journal = "Proceedings of the IEEE", year = 1990, volume = 78, number = 10, pages = "1605--1613", month = "October" } Their hearts seemed in the right place but their actual algorithm didn't make much sense to me... @Article{BichselEntropy, author = "Martin Bichsel and Peter Seitz", title = "Minimum Class Entropy: A Maximum Information Approach to Layered Networks", journal = "Neural Networks", year = 1989, volume = 2, pages = "133-141" } This seemed quite similar to Friedman's 1977 article, I'm not sure how it extended those results. @Article{RoundsDT, author = "E. M. Rounds", title = "A Combined Nonparametric Approach to Feature Selection and binary decision tree design", journal = "Pattern Recognition", year = 1980, volume = 12, pages = "313--317" } Actually only uses perceptrons at the leaves, where an MSE criterion is appropriate @InProceedings{UtgoffPerceptronTreesAAAI, author = "Paul E. Utgoff", title = "Perceptron Trees: A Case Study in Hybrid Concept Representations", crossref = "AAAI88", pages = "601--606" } Historical book on decision trees. @Book{HuntCLS, author = "Earl B. Hunt and Janet Marin and Philip J. Stone", title = "Experiments in Induction", publisher = "Academic Press", year = 1966, address = "New York" } There was some really old work on decision trees (before computers were fast enough to handle real datasets) that was implemented on a card sorter -- you punch each instance in your training set onto a card, and you feed this set through a card sorter which sorted them into two piles, you then feed one of the subsets through and it sorts that into 2 piles and so on. So it ends up building a decision tree but you have to keep track of all this yourself. I've lost my reference to this work -- if anyone remembers who did it I'd be happy to hear from you.