Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Using Dates In Data Mining Models


Using dates in any kind of predictive modeling model can be tricky to get right. It is important to be clear about what you are trying to accomplish. Suppose, for example, we are trying to predict sales of a specific brand of beer in a given store and have daily sales data going back several years. One of the patterns we are going to want to track and capture is “seasonality,” which refers to changes in sales levels due to the season of the year. We might find that beer sales of all types are typically highest in the summer months, lowest in the winter, and intermediate in spring and fall. Of course, seasonality is only one factor among many, and good forecasts will require much more information than the date. To capture seasonality, statisticians and econometricians have long resorted to introducing variables to reflect the season of the year. This could be captured by a categorical variable coded, say, “fall” “winter” “spring” “summer.” A modeler might instead prefer to introduce a variable for the month of the year or even the week or the day of the year. The point is that this variable would be extracted from the date, and we would leverage the fact that we can observe the seasonal pattern more than once to draw conclusions about something like a “summer effect.”

This logic can be carried further to create variables such as "week before X," where "X" could be Easter, Christmas, a major holiday or three day weekend, etc. Again, for such patterns to be learned from the data successfully, the patterns in the data should have been observed several times. The same strategy has been used to detect patterns tied to the day of the week (say, a "Friday effect"). In this regard, no real difference exists between data preparation for a conventional statistical model and for data mining. With enough data, patterns can be isolated for multiple frequencies (daily, weekly, monthly).

Another important time–related effect on which statisticians often focus is the "trend." A trend is a long–term steady change in the base level of a variable over time that is often used to make projections into the future. For example, social scientists have noted that in many developed countries the percentage of adult smokers has been declining steadily over the last decades. Models designed to predict future cigarette sales will probably do better if they accommodate such trends because they often persist. Trends are normally estimated as the simplest possible average steady rate of change suggested by the data. To capture the monthly trend in a regression model we would add a numeric variable to represent the month to which a record of data pertains, starting with one for the first month, two for the second month, and continuing into a second year of data with 13 for the first month of the second year, etc. Trends differ from seasonal effects in that the value of the trend predictor is always increasing and it never wraps back around to the beginning.

A statistical model including a time trend would typically allow for a single regression coefficient to capture the steady long-term change observed in the process being studied. If the model discovers a decreasing trend, for example, of 1% per year, we can adjust all predictions made for one year in the future downwards by 1% to reflect this trend. (A regression model does this automatically.)

The problem with data mining models such as the CART decision tree and TreeNet is that they do not contain the linear mathematical structure to extend the trend pattern into future predictions. Instead, such tree–based models "flat line" at the edges of the trend (the start and end points of the training data) and turn the trend off outside the range of the training data. This means that we need to deal with the trend separately if we want to use CART or TreeNet. One effective way to do this is to simply estimate the overall trend separately in a simple regression of the target variable on a time or date variable, where the date variable is continuous. Then, we "detrend" the data by subtracting the predicted trend effect from the target, and use this detrended data for analysis. Once the CART or TreeNet model is completed we make predictions by first predicting the detrended value of the target and then we "add back in"p the trend value. The two parts together make up the final prediction.


Tags: Blog, Data Mining, Dates