«Paper 157-2011 Data mining in SAS® with open source software Zhengping Ma, Eli Lilly and Company ABSTRACT It is common in many industries for data ...»
SAS Global Forum 2011 Data Mining and Text Analytics
Data mining in SAS® with open source software
Zhengping Ma, Eli Lilly and Company
It is common in many industries for data to exist in SAS format. Statisticians are often more familiar with the SAS
programming environment in comparison to other systems. Therefore, SAS Enterprise Miner is usually the first choice for data mining projects under such circumstances. On the other hand, WEKA is a data mining suite which uses open source code and is available free of charge. More importantly, it offers opportunities to modify the source code for algorithm customization. It also re-implements many classic data mining algorithms, including C4.5 which is called J48 in WEKA. An additional advantage of using WEKA is that it can run hundreds of variations of a model with command mode or customized coding, which is often necessary in research and key for developing reusable tools.
This paper presents ways to connect SAS to WEKA and R and to run packages defined outside SAS for data mining projects. It enables us to develop tools for tasks such as subgroup identification utilizing features available in WEKA, rattle, etc. In addition, it helps to reduce cost by using free of charge open source packages for exploratory analysis.
Most statisticians are highly familiar with the SAS programming environment. With its state-of-art packages, SAS Enterprise Miner is usually the tool of choice for data mining projects, especially when SAS is the preferred software for data analysis. SAS EM comes with an easy-to-use graphical user interface. When necessary, it is also possible to do some customization with extension nodes.
Alternatively, WEKA, as an open source data mining tool with perhaps the largest number of algorithms of any data mining tool, is available free of charge. Since it is open source software, it is easy to modify its source code and recompile for algorithm customization in WEKA. It is also very straightforward to run many variations of similar models using the same algorithm with a simple loop.
Rattle is another data mining application built upon R. While an understanding of R is not required in order to use Rattle, it does empower a user to conduct more sophisticated data mining projects by taking advantage of all features available in R. Rattle is simple to use, quick to deploy, and allows a user to rapidly work through the steps from data preparation, modeling, to evaluation in a data mining project. On the other hand, R provides a very powerful platform for performing data mining well beyond the limitations that are embodied in any graphical user interface. A significant benefit of using rattle is that when we can easily migrate from rattle to R to fine tune and further develop our data mining projects when it’s necessary. Migration into R is very straightforward since rattle exposes all of the underlying R code in its log window, which can be either directly deployed within R, or saved in R scripts for future reference.
These tool generated R scripts can also be modified and used as a tool for similar data mining projects in the future.
For a variety of reasons, sometimes it is desirable that we can use SAS as a working environment and are able to take advantage of tools such as WEKA, rattle, and R in general at the same time. This paper presents ways to connect SAS with WEKA and R, use algorithms defined in WEKA/R package and return the results back into SAS for further processing. The recently rolled-out SAS/IML studio 3.2 comes with a number of useful new features such as calling SAS, R etc. We will use it as a tool for general SAS programming and connecting to external tools/systems.
CONFIGURE SAS/IML STUDIO FOR WEKA CLASSESUnfortunately, as of today, there is no direct connection between SAS and WEKA available. However, since both SAS/IML studio and WEKA are commonly built upon Java technology, we can use WEKA classes from SAS/IML studio as user defined classes with Java as the bridge.
Below are steps to configure SAS/IML studio so that it is able to find and load the class definitions of WEKA classes and use them within SAS/IML studio. Please refer to appendix 4 for more details with visual illustration.
1. Download WEKA from http://www.cs.waikato.ac.nz/ml/weka/, save the file weka.jar to the desired location of your choice.
2. Add the path to file weka.jar into SAS/IML Studio by clicking on tools, then options Click on ‘Directories’ tab, from the ‘show directories for’ drop down list, select ‘Classes’ 3.
SAS Global Forum 2011 Data Mining and Text Analytics
USING WEKA CLASSES IN SAS/IML STUDIOWith the appropriate set-up illustrated above, you are ready to take advantage of features in WEKA. However, using WEKA classes in SAS/IML studio directly requires some knowledge of java programming. Typically, you will need to import the relevant packages before you can use methods defined for classes, or create instances for model building.
The first step in a data mining project is to load/prepare data. Below is sample code for this step:
/* import WEKA classes, J48 is used as an example, import as many as needed */ import weka.classifiers.trees.J48;
It is also very important for a data miner to know his/her data well before any model is applied to data. WEKA’s graphic user interface comes with visualization tools to allow the user to see the distribution of all variables. Even though you can also use these features programmatically, it is much easier to use the GUI so that you can see those visualizations interactively.
Assuming that you decide to use the classical tree algorithm, which is implemented in WEKA class J48, below is the
/* create model builder, an instance of J48 */ declare J48 myJ48=new J48();
/* build model based on data */ myJ48.buildClassifier(mydata);
After a model has been built based on training data, you can do cross-validation, make prediction for new data based on the model, etc.
declare Instances test = ConverterUtils$DataSource.read("c:/weka/test.csv");
/*evaluate model on test data */ declare EvaluationUtils evu= new EvaluationUtils();
declare weka.core.FastVector predics = evu.getTestPredictions((Classifier)myJ48,test);
Those predicted probabilities can be used to create ROC curve. The details of programming in SAS/IML studio are beyond the scope of this paper. However, you can check the attached sample code for an example, which also demonstrates how to use SAS data steps to create simulation data for model evaluation/prediction, and how to use packages in R to draw ROC curves based on model predictions. You can also refer to SAS/IML help documents, or documents of WEKA API for more details.
USING RATTLE/R IN SAS/IML STUDIOOne of the new features within the SAS/IML studio 3.2 is that it offers the capability to connect to R with a submit block. Since rattle is an R based tool, we can use this feature directly to run data mining projects with rattle from SAS,
as the following code segment illustrates:
submit / R;
There are two different ways to send data to R. The first is by using the method defined in ExportDataSetToR, which
uses SAS dataset as source data, as illustrated below:
run ExportDataSetToR("work.mySASdata", "myRdata" );
The other is use ExportToR method defined in DataObject, which uses a DataObject as source, as illustrated below:
declare DataObject dobj;
dobj = DataObject.CreateFromFile( "myDataFile" );
dobj.ExportToR( "myRdataFrame" );
Similarly, there are two different ways to get data from R, we can either use ImportDataFromR module, as run ImportDataSetFromR( "WORK.mySASdataset", "myRdataFrame" );
Or use CreateFromR method defined in DataObject, as illustrated below:
declare DataObject dobj;
dobj = DataObject.CreateFromR( "SAS_IML_DataObject", "R_DataFrame" );
As illustrated above, submitting code to R for processing is very straightforward. For example, with package rattle
installed into R, we can send the following code to R to prepare the data:
submit / R;
# Load the data.
crs$dataset - read.arff("file:///C:/weka/weka-3-6-2/data/iris.arff") # Partition the source data into the training/validate/test datasets set.seed(42) crs$sample - crs$train - sample(nrow(crs$dataset), 105) crs$validate - sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 22) crs$test - setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) endsubmit;
We can then add following code into the same submit block to build a decision tree model based on the data:
crs$rpart - rpart(class ~., data=crs$dataset[crs$train, ], method="class", parms=list(split="information"), control=rpart.control(minsplit=5, minbucket=2, usesurrogate=0, maxsurrogate=0))
And make predictions based on the model with following code:
p-predict(crs$rpart, newdata=test, type="prob") Again, the details of R programming are beyond the scope of this paper. Please refer to appendix 2 for sample code, which is a modified copy created by rattle with it graphic user interface. It contains sample code for loading data, building a decision tree model, making model predictions, etc.
USING WEKA THROUGH RWEKA FROM SAS/IML STUDIOIt should also be mentioned that there is an R package called RWeka, which serves as a bridge between WEKA and R and enables a user to use WEKA from R. Therefore, WEKA can also be used from SAS in an indirect way through RWeka, similar to how we use rattle, as described above. Note that you need to have both rJava and RWeka packages installed in your R installation before you can use them from SAS/IML studio. A sample code with some use cases of different models is also attached below, containing examples from loading data, building model to making model predications and creating ROC curves. In addition, it also provides examples of building models based on other algorithms. Overall, if R is a programming language you are comfortable with, it is an easier alternative of using WEKA for statisticians, since it hides some lower level technical details and does not require detailed SAS Global Forum 2011 Data Mining and Text Analytics knowledge of WEKA classes. Please refer to appendix 3 for more details. Given the indirect approach we used here, from SAS/IML studio to R, and to WEKA through RWeka, this example can also serve as a reference of how to use WEKA from R if your preferred working environment is R instead of SAS.
I would like to point out that the techniques introduced in this paper can be applied in general to connect SAS to any java based and R based external tools, including using any user defined java classes or R functions for any project beyond the scope of the data mining field. It can be very useful to take advantage of past work, or share work completed in different systems with your colleagues. It is also worth mentioning that SAS/IML studio comes with powerful tools for graphics programming to create dynamically linked plots. By pulling data into IML studio from external environments, we are able to create customized visualizations of your results within SAS/IML studio.