Dan Steinberg's Blog
On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Automatic Detection Of Variable Types In SPM

SPM offers some degree of automatic type detection when it reads a database, but this support may still require some additional effort on the part of a user. How SPM works depends on the file type being read:

CSV Or Other Plain Text

SPM uses two sources of information to determine file types in plain text files. First, if the name of a variable ends with a "$" symbol then the variable will be treated as character, and therefore categorical, regardless of the actual values found for the variable. Thus, even if the variable contains only numbers, they will be treated as text when the variable name so dictates.

If the variable name does not end with a "$" the variable may still be treated as character (and thus categorical) if a scan of the data reveals values other than strictly numerical data. This possibility requires careful attention because the data scan may lead to unintended treatment as text, for example, if missing values are represented by entries such as "MISSING" or "NA". To avoid such unintended processing it is important to represent missings with blanks, or with the dot as in ".". (The quote marks must not be entered).

If the variable is purely numeric but the analyst wishes to treat the variable as categorical then this must be conveyed to the modeling engines by either the command:

CATEGORICAL variable_name

Or in the GUI by checking the categorical column in the Model Setup dialog.


Binary File Types (Statistical Packages)

These file types contain information about each variable in the header of the file which allows SPM to proceed without first scanning the file, and of course is much faster than scanning a plain text file. For these file types the distinction that SPM makes immediately is between numeric and non-numeric. The non-numeric variables are always treated as categorical. The numeric variables will be treated as categorical only in response to direct action via the CATEGORY command or checking the categorical box on the model setup dialog.

SPM does not recognize categorical declaration of numerical variables embedded in the metadata of SAS data files.


Excel Files

Excel files are a hybrid file type: on the one hand they contain metadata which could guide how the columns of the spreadsheet are handled. On the other hand Excel gives you so much freedom and control at the cell level that SPM typically scans the data in order to make final decisions. The topic of excel files is complicated enough that we treat it separately in another article.


Tags: Blog, SPM, Automation