Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Accurate results with limited data in CART and TreeNet

In our example we obtain surprisingly good results on holdout data using an unusually small sample.

For the sake of the example we started with a dataset in our archives containing 235,580 records recording the outcome of a binary target variable, representing a GOOD or a BAD outcome for a financial services organization where the BAD rate was about 13.25%. (The extract discussed here has been modified to disguise the actual real world patterns in it, including the BAD rate. However, the levels of predictability represented here do reflect the real world). While the original data had several hundred available predictors we selected 45 to work with (any variable names below have been falsified but still reflect the general nature of the original variable).

We started by reserving 64,889 records for the set-aside or holdout sample that would not be used in any way in model construction or selection. This sample size is large enough to plausibly stand in for the "truth". The remaining data could be partitioned in any way between learn and test roles and we experimented with many such divisions. In this example we happen to have selected 41, 712 records for the learn set and left the remaining 128,979 records for test.


Our first TreeNet model used all the learn data and yielded these performance results

    ROC Total Learn Sample
Test   .93389 41,712
Holdout   .93326  

The excellent agreement between the Train and Holdout partition results is typical of TreeNet models. For the sake of argument let us assume that this is the best that we can do with this data and this category of learning machine.

In a series of experiments we progressively lowered the learn sample in order to observe the expected drop in Test and Holdout ROC performance. In the first experiment we reduced the learn sample just by randomly deleting records from the non-default group (GOODs), arriving at a roughly equal number of GOODs and BADs (The counts were 14,566 and 12,623). We expected to see a noticeable drop in performance but instead observed:

    ROC Total Learn Sample
Test   .93338 27,189
Holdout   .93265  

This was a surprise because previous experiments have suggested that there is usually a substantial benefit to increasing the good:bad ratio in such models well above 1:1. We next looked at reductions in the size of the overall learn partition while keeping the good:bad ratio roughly constant. We reduced the partition dramatically, by more than 80% leaving us with just 4,619 learn records roughly evenly divided between GOODs and BADs; the test and holdout partitions were left unchanged in every experiment 14. This reduced learn partition allowed us to reach a model with:

    ROC Total Learn Sample
Test   .92001 4,619
Holdout   .91959  

Most researchers would consider a training sample of 4,619 to be on the small side and far less desirable than 27,189. Nonetheless, our drop in performance on our previously unseen data is surprisingly modest (about 1.4%)

Intrigued, we lowered the learn partition size once again, this time to just 1,516 records (756 class 0, 760 class 1) to obtain:

    ROC Total Learn Sample
Test   .91273 1,516
Holdout   .91349  

Taking the process yet another step further, we cut the learn sample down to a mere 759 records (384 class 0, 375 class 1).

    ROC Total Learn Sample
Test   .90472 759
Holdout   .90415  

Again, recall that the test and holdout partitions are unchanged at 128,979 and 64,889 records. These partitions are large enough to leave little room for doubt about the reliability of the results. Taking one last attempt to disadvantage the model we dropped the learn partition size once more to just 369 records (185 class 0, 184 class 1). Surely this is pushing the limits of a suitable training set! Here we obtained:

    ROC Total Learn Sample
Test   .88895 369
Holdout   .89027  

This represents less than a 5% reduction in area under the ROC curve in response to a more than 99% reduction in sample size.

Could it be that these results are due to the availability of a handful of super performing predictors? While it is true that this data is quite predictable and that building a good model is quite easy, there is nothing resembling a perfect predictor that would work in any sized data set. Running a single CART tree on this data yields a test partition ROC of only .79074 which is similar to what we also see for the performance of the first TreeNet tree (as we track the performance of the ensemble as it evolves). Conventional statistical modeling would require dealing with the many missing values; the missing value rate is about 25% for 18 of the 45 available predictors and even greater for another 12. The TreeNet graphs also suggest that the main drivers are clearly not linear in the way they drive the target.

One of the leading credit risk modeling companies has used a guideline of 500 GOODS and 500 BADS as a minimum for reliable model building (at least as stated by their speakers at professional conferences such as the Edinburgh Conference on Credit Risk and Credit Control. Papers for those conferences are archived at http://www.business-school.ed.ac.uk/crc/conferences/conference-archive). Should we conclude they are indeed correct and that we can make excellent progress with even smaller samples? All we are prepared to say is that the ideal sample size is going to vary, and sometimes dramatically, from problem to problem.

One thing to keep in mind is that we are not always focused exclusively on predictive accuracy when we build predictive models, and that the reliability of the insights we want to extract from data may suffer with smaller samples. In the example above it is instructive to notice that the ranking of the predictors is fairly different across the two sample sizes. The first display comes from our first model using the largest training sample and the second comes from the smallest. Taking the larger sample results as the "truth" we observe that the small sample model down ranks the most important variable substantially.

learn partition 1

Learn Partition N=41,712

learn partition 2

Learn Partition N=369

Not surprisingly the small sample also shows signs of over-fitting as evidenced by the sharp divergence of the learn (blue curve) and test (red curve) sample results (again the larger sample model results are shown first):

TreeNet plots

Learn Partition N=41,712

TreeNet plots 2

Learn Partition N=369

An observation: By using a very large test sample we were able to make a very choice when it comes to optimal model size (number of trees in the ensemble) for the smallest training sample. In the real world, if we were confined to such a small sample we might have great difficulty in deciding how many trees to actually use. If we had erroneously chosen an over-fit model built with 700 trees our performance measures on the unseen holdout sample would have been worse.

So what if we had resorted to cross-validation, the natural thing to do with a learn sample N=369? Interestingly enough the CV run turns out to be a reasonably good guide to the optimal number of trees in the model although it does slightly over-estimate the predictive performance of the model on unseen data: The over-fitting is hinted at in the display below but again what is more impressive is the similarity of the results to that obtained with far larger learn samples sizes.

TreeNet graphs

Cross-Validation on the N=369 Learn Sample

Our conclusion is certainly not that we recommend seeking to work with small learning samples. As experienced data miners we have always opted for as much data as we could get. Instead our conclusion is that we should not dismiss smaller samples or worry unduly when sample sizes are not huge. But another conclusion we do intend to draw is that it is not axiomatic that additional data will meaningfully improve the predictive performance of model.


Tags: CART, Blog, TreeNet