Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Which Data Mining, Predictive Modeling Engine is Best for Me?

We are often asked “Which analytical technology is best for my problem?” This topic not only comes up in practical, day-to-day modeling, but has also been the subject of a few (largely disappointing) academic studies. The short answer we usually give is that for many modeling problems it doesn’t require much time to run several different analyses, so why not rely on experimentation rather than some rule of thumb? If one method stands out for any reason, such as accuracy or intuitive attractiveness of the model, then you have your answer.

Over the course of more than a decade of intensive modeling experience, however, we have observed a few patterns that are worth discussing. This is the first in a series of items that will focus on the choice of data mining engine.

Start with CART!

CART is a technology that can work with any mix of data types, is resistant to outliers, and is powerfully adept at dealing with missing values. You can run a CART model well before you would sensibly dare to run a conventional statistical model. Even if the data are in terrible shape, you will still probably learn something of value. We often find that the CART model reveals a variety of surprises, including:

  • Overall predictability of your target (dependent) variable.

    • Naturally, we are always suspicious of results that are too good. When the results are too good to be true, we look for predictors that should be excluded because they represent information that would never be available to us in a real-world forecasting situation.

    • We also ask questions about the data if the results are really poor. Just this week on some client data we found an initial R-Squared of 0.00 on a particular model. Further investigation revealed major differences between the training data and the test data.

  • Clone variables. When one of the clones is a primary splitter in the tree, all the other clones will show up as surrogates. Perfect clones come with an Association score of 1.00. The CART model will recognize clones even after detecting nonlinear transforms, whereas correlation will not.

  • Non-credible results. Although the root node splitter in a CART tree is not necessarily the most important predictor in the model, it is usually a rather well-known and well-understood driver of the model. So what do you do if the root node splitter is something unexpected and one not generally thought by domain experts to be very interesting? In one project, just such a strange splitter allowed us to determine a major data processing error that contaminated that variable with future information.

  • Impossible patterns in the tree. It pays to study the decision logic in the top few levels of the CART tree. Does the flow make sense to a domain expert? Sometimes the tree appears to be describing an impossible situation. We encountered such an example in e-commerce web log data for which we were trying to predict BUY/NO BUY behavior. The root node of the tree asked whether the web site visitor had registered on the site. Not surprisingly, people who register on a site are far more likely to make purchases on that site than anonymous visitors. Among those who registered, the tree followed by asking where the visitor lived. This information was provided by many of the registrants and thus was generally available on this side of the tree. Turning to the side of the tree devoted to those who had NOT registered, we again saw CART asking where the visitor lived. But for non-registrants, this information is generally not available. So why was CART trying to leverage this information?

    The answer turned out to be embarrassingly simple: the REGISTER/NO REGISTER flag in the client’s database was inaccurate, and a large number of registrants had been flagged as not-registered, even though any information provided by the visitor while filling out the registration forms was correctly captured. So when CART turned to the largely non-registered partition of the data it found that it had some useful information that could be leveraged for prediction.

  • The database error that we discovered with the CART tree was fundamental to the client’s business and the error had been missed by many others who had worked with these data. By starting with a simple CART tree, we uncovered the problem in our first few days of inspecting the data.

These examples should make it plain that there is much to gain and really nothing to lose by starting with CART. You do need to become comfortable with reading the tree and its component reports, but once you do, the payoff should come quickly. In our own consulting work we cannot recall a single real-world data set that did not have at least one serious problem that we were able to discover with simple CART trees.


Tags: Blog, Data Mining