FREE ELECTRONIC LIBRARY - Books, dissertations, abstract

Pages:   || 2 |

«Paper 157-2011 Data mining in SAS® with open source software Zhengping Ma, Eli Lilly and Company ABSTRACT It is common in many industries for data ...»

-- [ Page 1 ] --

SAS Global Forum 2011 Data Mining and Text Analytics

Paper 157-2011

Data mining in SAS® with open source software

Zhengping Ma, Eli Lilly and Company


It is common in many industries for data to exist in SAS format. Statisticians are often more familiar with the SAS

programming environment in comparison to other systems. Therefore, SAS Enterprise Miner is usually the first choice for data mining projects under such circumstances. On the other hand, WEKA is a data mining suite which uses open source code and is available free of charge. More importantly, it offers opportunities to modify the source code for algorithm customization. It also re-implements many classic data mining algorithms, including C4.5 which is called J48 in WEKA. An additional advantage of using WEKA is that it can run hundreds of variations of a model with command mode or customized coding, which is often necessary in research and key for developing reusable tools.

This paper presents ways to connect SAS to WEKA and R and to run packages defined outside SAS for data mining projects. It enables us to develop tools for tasks such as subgroup identification utilizing features available in WEKA, rattle, etc. In addition, it helps to reduce cost by using free of charge open source packages for exploratory analysis.


Today, there are many options available to choose from as tools for data mining projects. In many industries such as pharmaceuticals, SAS is commonly used for data analysis. As a result of this, data is often available in SAS format.

Most statisticians are highly familiar with the SAS programming environment. With its state-of-art packages, SAS Enterprise Miner is usually the tool of choice for data mining projects, especially when SAS is the preferred software for data analysis. SAS EM comes with an easy-to-use graphical user interface. When necessary, it is also possible to do some customization with extension nodes.

Alternatively, WEKA, as an open source data mining tool with perhaps the largest number of algorithms of any data mining tool, is available free of charge. Since it is open source software, it is easy to modify its source code and recompile for algorithm customization in WEKA. It is also very straightforward to run many variations of similar models using the same algorithm with a simple loop.

Rattle is another data mining application built upon R. While an understanding of R is not required in order to use Rattle, it does empower a user to conduct more sophisticated data mining projects by taking advantage of all features available in R. Rattle is simple to use, quick to deploy, and allows a user to rapidly work through the steps from data preparation, modeling, to evaluation in a data mining project. On the other hand, R provides a very powerful platform for performing data mining well beyond the limitations that are embodied in any graphical user interface. A significant benefit of using rattle is that when we can easily migrate from rattle to R to fine tune and further develop our data mining projects when it’s necessary. Migration into R is very straightforward since rattle exposes all of the underlying R code in its log window, which can be either directly deployed within R, or saved in R scripts for future reference.

These tool generated R scripts can also be modified and used as a tool for similar data mining projects in the future.

For a variety of reasons, sometimes it is desirable that we can use SAS as a working environment and are able to take advantage of tools such as WEKA, rattle, and R in general at the same time. This paper presents ways to connect SAS with WEKA and R, use algorithms defined in WEKA/R package and return the results back into SAS for further processing. The recently rolled-out SAS/IML studio 3.2 comes with a number of useful new features such as calling SAS, R etc. We will use it as a tool for general SAS programming and connecting to external tools/systems.


Unfortunately, as of today, there is no direct connection between SAS and WEKA available. However, since both SAS/IML studio and WEKA are commonly built upon Java technology, we can use WEKA classes from SAS/IML studio as user defined classes with Java as the bridge.

Below are steps to configure SAS/IML studio so that it is able to find and load the class definitions of WEKA classes and use them within SAS/IML studio. Please refer to appendix 4 for more details with visual illustration.

1. Download WEKA from http://www.cs.waikato.ac.nz/ml/weka/, save the file weka.jar to the desired location of your choice.

2. Add the path to file weka.jar into SAS/IML Studio by clicking on tools, then options Click on ‘Directories’ tab, from the ‘show directories for’ drop down list, select ‘Classes’ 3.

SAS Global Forum 2011 Data Mining and Text Analytics

–  –  –


With the appropriate set-up illustrated above, you are ready to take advantage of features in WEKA. However, using WEKA classes in SAS/IML studio directly requires some knowledge of java programming. Typically, you will need to import the relevant packages before you can use methods defined for classes, or create instances for model building.

The first step in a data mining project is to load/prepare data. Below is sample code for this step:

/* import WEKA classes, J48 is used as an example, import as many as needed */ import weka.classifiers.trees.J48;

–  –  –

It is also very important for a data miner to know his/her data well before any model is applied to data. WEKA’s graphic user interface comes with visualization tools to allow the user to see the distribution of all variables. Even though you can also use these features programmatically, it is much easier to use the GUI so that you can see those visualizations interactively.

Assuming that you decide to use the classical tree algorithm, which is implemented in WEKA class J48, below is the

sample code:

/* create model builder, an instance of J48 */ declare J48 myJ48=new J48();


/* build model based on data */ myJ48.buildClassifier(mydata);

After a model has been built based on training data, you can do cross-validation, make prediction for new data based on the model, etc.

declare Instances test = ConverterUtils$DataSource.read("c:/weka/test.csv");


/*evaluate model on test data */ declare EvaluationUtils evu= new EvaluationUtils();

declare weka.core.FastVector predics = evu.getTestPredictions((Classifier)myJ48,test);

–  –  –

Those predicted probabilities can be used to create ROC curve. The details of programming in SAS/IML studio are beyond the scope of this paper. However, you can check the attached sample code for an example, which also demonstrates how to use SAS data steps to create simulation data for model evaluation/prediction, and how to use packages in R to draw ROC curves based on model predictions. You can also refer to SAS/IML help documents, or documents of WEKA API for more details.

–  –  –


One of the new features within the SAS/IML studio 3.2 is that it offers the capability to connect to R with a submit block. Since rattle is an R based tool, we can use this feature directly to run data mining projects with rattle from SAS,

as the following code segment illustrates:

submit / R;

–  –  –

There are two different ways to send data to R. The first is by using the method defined in ExportDataSetToR, which

uses SAS dataset as source data, as illustrated below:

run ExportDataSetToR("work.mySASdata", "myRdata" );

The other is use ExportToR method defined in DataObject, which uses a DataObject as source, as illustrated below:

declare DataObject dobj;

dobj = DataObject.CreateFromFile( "myDataFile" );

dobj.ExportToR( "myRdataFrame" );

Similarly, there are two different ways to get data from R, we can either use ImportDataFromR module, as run ImportDataSetFromR( "WORK.mySASdataset", "myRdataFrame" );

Or use CreateFromR method defined in DataObject, as illustrated below:

declare DataObject dobj;

dobj = DataObject.CreateFromR( "SAS_IML_DataObject", "R_DataFrame" );

As illustrated above, submitting code to R for processing is very straightforward. For example, with package rattle

installed into R, we can send the following code to R to prepare the data:

submit / R;

# Load the data.

crs$dataset - read.arff("file:///C:/weka/weka-3-6-2/data/iris.arff") # Partition the source data into the training/validate/test datasets set.seed(42) crs$sample - crs$train - sample(nrow(crs$dataset), 105) crs$validate - sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 22) crs$test - setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) endsubmit;

We can then add following code into the same submit block to build a decision tree model based on the data:

crs$rpart - rpart(class ~., data=crs$dataset[crs$train, ], method="class", parms=list(split="information"), control=rpart.control(minsplit=5, minbucket=2, usesurrogate=0, maxsurrogate=0))

And make predictions based on the model with following code:

p-predict(crs$rpart, newdata=test, type="prob") Again, the details of R programming are beyond the scope of this paper. Please refer to appendix 2 for sample code, which is a modified copy created by rattle with it graphic user interface. It contains sample code for loading data, building a decision tree model, making model predictions, etc.


It should also be mentioned that there is an R package called RWeka, which serves as a bridge between WEKA and R and enables a user to use WEKA from R. Therefore, WEKA can also be used from SAS in an indirect way through RWeka, similar to how we use rattle, as described above. Note that you need to have both rJava and RWeka packages installed in your R installation before you can use them from SAS/IML studio. A sample code with some use cases of different models is also attached below, containing examples from loading data, building model to making model predications and creating ROC curves. In addition, it also provides examples of building models based on other algorithms. Overall, if R is a programming language you are comfortable with, it is an easier alternative of using WEKA for statisticians, since it hides some lower level technical details and does not require detailed SAS Global Forum 2011 Data Mining and Text Analytics knowledge of WEKA classes. Please refer to appendix 3 for more details. Given the indirect approach we used here, from SAS/IML studio to R, and to WEKA through RWeka, this example can also serve as a reference of how to use WEKA from R if your preferred working environment is R instead of SAS.

I would like to point out that the techniques introduced in this paper can be applied in general to connect SAS to any java based and R based external tools, including using any user defined java classes or R functions for any project beyond the scope of the data mining field. It can be very useful to take advantage of past work, or share work completed in different systems with your colleagues. It is also worth mentioning that SAS/IML studio comes with powerful tools for graphics programming to create dynamically linked plots. By pulling data into IML studio from external environments, we are able to create customized visualizations of your results within SAS/IML studio.


Pages:   || 2 |

Similar works:

«Eine überarbeitete Fassung dieses Beitrags ist erschienen in: Pohl, Inge (Hg.). Semantik und Pragmatik – Schnittstellen. Frankfurt/Main, Lang, 2008, S. 217-251. Epistemische Lesarten von Satzkonnektoren – Wie sie zustande kommen und wie man sie erkennt∗ Hardarik Blühdorn Institut für Deutsche Sprache Mannheim 1. Einleitung Epistemische Lesarten von Satzkonnektoren sind vor allem am Beispiel des so genannten epistemischen weil (weil mit V2-Komplement; vgl. Keller 1995) in die Diskussion...»

«www.ontologia.net/studies Ontology Studies 9, 2009 79-89 Finitude as Mark of Excellence. Habermas, Putnam and the Peircean Theory of Truth Francisco Javier Gil Martín Universidad de Oviedo Facultad de Filosofía Reception date / Fecha de recepción: 13-04-2009 Acceptation date / Fecha de aceptación: 22-06-2009 Resumen. Finito como señal de excelencia. Habermas, Putnam y la teoría pierceana de la verdad. Hilary Putnam y Jürgen Habermas comparten la convicción de que la finitud de la mente...»

«ASKLEPIOS DER HOMERISCHE ARZT UND DER GOTT VON EPIDAUROS) Homers Angaben zu Asklepios sind so bekannt wie bedauerlicherweise knapp. Der Dichter spricht in der !lias wie selbstverständlich von Asklepios, einem thessalischen Fürsten, als dem untadeligen Arzt (11. 4,194), ohne ihn in seiner Tätigkeit zu beschreiben. Da jedoch für die Frühzeit weitere Zeugnisse fehlen, bleibt der Text die einzige Basis für unsere Untersuchungen. Die Durchsicht der entsprechenden Verse zeigt zunächst in...»

«2014 ARBEITSPAPIER – WORKING PAPER 153 Alessa Wilhelm Handwerk Eine ethnologische Annäherung am Beispiel der Schuhmacherei ARBEITSPAPIERE DES INSTITUTS FÜR ETHNOLOGIE UND AFRIKASTUDIEN WORKING PAPERS OF THE DEPARTMENT OF ANTHROPOLOGY AND AFRICAN STUDIES AP IFEAS 153/2014 Herausgegeben von / The Working Papers are edited by: Institut für Ethnologie und Afrikastudien, Johannes Gutenberg-Universität, Forum 6, D-55099 Mainz, Germany. Tel. +49-6131-3923720; Email: ifeas@uni-mainz.de;...»

«Leitfaden für Rezertifizierung durch CERPs oder Examen für Personen, die als International Board Certified Lactation Consultant® rezertifizieren Inhaltsverzeichnis Was ist IBLCE®? Zweck und Methoden der Rezertifizierung Anforderungen für die Rezertifizierung IBLCE Kontaktdaten. Termine für die Rezertifikation 4 Wichtige Veröffentlichungen Bewerbung für Rezertifizierung durch Examen Füllen Sie das Bewerbungsformular aus Gebühren und Zahlungen Examensorte.. 5 Begründete...»

«Dissertation zur Erlangung des Doktorgrades der Fakultät für Chemie und Pharmazie der Ludwig-Maximilians-Universität München Anandamid-vermittelte Aufnahme von siRNA in Immunzellen Julian Willibald aus Garmisch-Partenkirchen Erklärung Diese Dissertation wurde im Sinne von § 7 der Promotionsverordnung der LMU vom 28. November 2011 von Herrn Prof. Dr. Thomas Carell betreut. Eidesstattliche Versicherung Diese Dissertation wurde eigenständig und ohne unerlaubte Hilfe erarbeitet. München,.....»

«Biochemical, biophysical and functional analysis of the DsrMKJOP transmembrane complex from Allochromatium vinosum Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn vorgelegt von Fabian Grein aus Bonn Bonn 2010 Angefertigt mit Genehmigung der Mathematisch Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn 1. Gutachter: apl. Prof. Dr. Christiane...»

«A theory of collegiality and its relevance for understanding professions and knowledge intensive organizations Emmanuel Lazega 1 University of Lille Institut Universitaire de France Published in Thomas Klatetzki und Veronika Tacke (Hrsg.)(2005), Organisation und Profession, pages 221-251. Wiesbaden : VS Verlag für Sozialwissenschaften, ISBN 3-531-14257-7 Introduction Complex tasks that cannot be routinized define professional and knowledge intensive work. When such tasks are carried out by...»

«Ministerium für Landwirtschaft, Mecklenburg Umwelt und Verbraucherschutz Vorpommern Fachinformation der zuständigen Stelle für landwirtschaftliches Fachrecht und Beratung Bodenerosion durch Wind – Entstehen, Prozess, Auftreten, Schäden, Schutzmaßnahmen – Winderosion ist der Abtrag und die Verfrachtung von Lockermaterial des Bodens durch Wind als Transportmittel über mehr oder weniger große Entfernungen in Abhängigkeit von seiner Korngröße (Mineralbestandteile) und seinem...»


<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.