Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Dan Steinberg's Blog

Dan Steinberg, President and Founder of Salford Systems, is a well respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.
Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

Displaying and Saving Numerical Results Precisely

Salford predictive modeling engines use high precision algorithms to compute essential results but printed reports and the GUI may display results with relatively less precision, for convenience of the display. There may well be circumstances when you need to pay careful attention to this however, and insist that the data mining tool print, display, and save results in the highest useful precision.

Continue Reading

CART® vs. The Clones

At Salford Systems we are frequently asked what the difference between the trademarked decision tree CART® is and the various clones that have been created by other companies, or that have been contributed as user written packages to community oriented systems. Our website contains a variety of essays and FAQs on this matter and we've link to them below. But here is a very brief summary of the details:

The original and true CART was written entirely by Stanford University Professor Jerome H. Friedman, and has always been proprietary source code available only to Salford Systems. Friedman is one of the inventors of CART and widely regarded as one of the most influential and important researchers in data mining. He is also considered one of the world's best algorithm writers and scientific programmers. In other words, we offer the only true CART written by a creator of this revolutionary technology. It contains everything discussed in the original CART monograph and much more that was not touched upon in the book.

Continue Reading

Rules of Thumb When Working With Small Data Samples


The original CART monograph discusses a study the authors performed working with 215 observations and 19 predictors, where 37 records were of class 1 and 178 of class 0. We think that this is example, with 37 examples in the smaller class is close the smallest sample size you can usefully work with CART.

Recommendation: We suggest using a minimum of 100 records, with the target variable distributed not more unbalanced than in proportions (1/3, 2/3) for up to 30 predictors. We recommend repeated cross-validation to estimate the out-of-sample (previously unseen data) performance.

Continue Reading

CART Tree Cloning: FORCING Identical Structure Across Different Targets

CART 7.0, an integrel part of Salford Systems Predictive Modeler SPM 7.0, offers a new feature to essentially clone a tree or subtree structure and impose it on any target variable you choose. This means, for example, that you can grow a CART tree on dependent variable Y1, optionally prune the tree judgmentally, and then extract the entire sructure of the tree and force it onto a new target variable Y2. The second tree will exhibit the identical structure to the first, but it will be "about" Y2 rather than Y1.

Of course, CART users have always had the option of growing a tree on target variable Y1 to create a segmentation of data. Predictions for any variable at all can then be made segment by segment by simply noting the mean values of other variables, such as, for example, Y2, and these could be taken as CART predictions for the other variables. So what are the advantages of the new tree cloning feature?

Continue Reading

Does CART allow multiple targets?

There are two ways to interpret your question:

Does CART® allow multi–class targets (eg, a class label with values 1,2,3,...etc)

CART has been used in real world classification problems with more than 400 classes.

In one project our goal was to predict which specific model of new car a given person actually bought. In the project there were more than 400 different car models available and the predictors were drawn from a lengthy set of attitude and interest questions.

For such models to be useful you need to have a decent sample size for each level of the target. In the car purchase study some models had been bought by more than 2000 people (a good sample size) while some exotic and expensive cars had been bought by fewer than 10 people (the total sample size was over 50,000 records). Naturally, we could not place much faith in the predictions concerning the least frequently bought cars. However, overall, the models built were both quite accurate and generated considerable insight into the factors influencing consumer choice in car purchases.

Continue Reading

A Few Comments On Boosting Decision Trees

Boosting is a machine learning strategy that came into being shortly after researchers discovered the value of "ensembles." Ensembles are collections of models which are used as a group to make predictions (and classifications) that are often considerably more accurate than individual models. The models are combined either by averaging predictions or using a voting scheme (for classification). Thus, if we built 101 classification models where the output of each model is a prediction of "YES" or "NO" then the ensemble prediction might follow a majority vote rule: predict YES for any record that obtains at least 51 YES votes, and predict "NO" otherwise. Some ensemble methods use weighted voting where the weights reflect the predictive accuracy of the individual models. In this post we want to focus on a few key ideas related to Salford products rather than the scientific field (we will do that in another post or paper).

Continue Reading

How many levels can a target variable have in CART® and other SPM data mining engines?

The Salford CART decision tree is exceptional in supporting an essentially unlimited number of target levels. Of course the vast majority of classification problems tackled by analysts have two classes, or are reformulated to have two classes. There is no reason, however, to confine yourself to just two levels if you are working with CART. In our training materials we discuss three–level, five–level, and ten–level examples in detail. The ten–level example concerns the reverse engineering of a clustering solution, in which a market researcher was looking to extract a simple set of rules that could be used to assign new records to a previously constructed clustering solution based on a very large number of variables. Ten levels is a rather small number when considering how far you might be able to stretch the CART machinery. In our work with a car manufacturer our goal was to predict the specific car model chosen by a new car buyer from a set of more than 400 alternatives. The analysis was based on survey responses to several hundred attitude and preference questions administered to more than 20,000 new car buyers, and the results yielded extraordinary insight into the needs and wants driving ultimate car model selection. In our own internal testing of CART classification based on synthetic data, we have successfully run CART models on targets with 1,000 levels.

Continue Reading

E-mail Conversation with Leo Breiman on using Out of Bag (OOB) data for pruning Bagged Trees

In 1995 Leo Breiman was actively experimenting with his first version of the bagger, and that at time I was in constant contact with him via email. In some cases at Salford Systems we implemented ideas of Leo's as we were discussing them with him. At other times we debated certain details and exchanged ideas in a lively give and take. Leo's initial ideas always took as a given that the bagged trees needed to be pruned and he was using 10–fold cross validation to do so. Because this added a substantial computational burden to the process I suggested that he use the OOB (out of bag) data to test and prune each bagged tree. In response, Leo began experimenting with this idea and eventually concluded that the entire training sample (both in–bag and out of bag) should be used to prune each bagged tree. Of course, subsequent research showed that unpruned trees were in fact ideal and thus the topic of using OOB data for pruning trees fell by the wayside. OOB data became very important in Leo"s subsequent work on RandomForests four years later.

The emails here are a selection of messages I received from Leo in mid–1995 on the topic. Unfortunately, we do not appear to have any copies of my side of the conversation. We hope to post other messages from Leo here from time to time as his remarks covered a very broad range of topics pertaining to trees and data mining.

Continue Reading