WWW.BOOK.DISLIB.INFO
FREE ELECTRONIC LIBRARY - Books, dissertations, abstract
 
<< HOME
CONTACTS



Pages:   || 2 |

«Paper 133-2008 Data Integration in a Grid-Enabled Environment Cheryl Doninger, SAS Institute Inc., Cary, NC Gary Mehler, SAS Institute Inc., Cary, NC ...»

-- [ Page 1 ] --

SAS Global Forum 2008 Data Integration

Paper 133-2008

Data Integration in a Grid-Enabled Environment

Cheryl Doninger, SAS Institute Inc., Cary, NC

Gary Mehler, SAS Institute Inc., Cary, NC

Nancy Rausch, SAS Institute Inc., Cary, NC

ABSTRACT

SAS® Data Integration Studio and SAS® Grid Manager add capabilities to the SAS® product suite to distribute

workloads across a grid of computers and thereby allow large processes to complete more quickly than previously possible. SAS Grid Manager has been incorporated into SAS Data Integration Studio to facilitate using grid resources for any long-running task that can be processed in parallel to another task. This paper discusses typical data integration workloads, how to scale them on typical grid computing hardware, and the new capability to load balance multiple data integration tasks across grid resources.

INTRODUCTION

SAS was originally developed as a 4GL programming language that could be used to write SAS applications.

These applications executed in a single, sequential path that matched the single CPU capabilities at that time.

Figure 1. SAS Version 8 Processing Capabilities As data volumes continued to grow and computing needs continued to increase, hardware vendors responded by creating symmetric multi-processor (SMP) machines.

SMP is a multiprocessor computer architecture in which two or more identical processors are connected to a single, shared main memory. Organizations also built networked computing environments with multiple, individual computing resources connected via networking protocols. SAS responded to these advances by developing multi-processing capabilities. In SAS/CONNECT Version 8, SAS applications could take advantage of the multi-processors available in desktop and server platforms, and could multi-process across platforms available in a network. With SAS/CONNECT you can spawn N SAS sessions or processes, simultaneously execute N tasks as independent processes, and coordinate the execution and results into the client or parent session. A major benefit of this technology is the flexibility for the multiple sessions to run on multiple CPUs within an SMP box, across multiple, distributed machines in a network, or a combination of both.

SAS Version 9 brought multi-threading capabilities to further leverage the growing adoption of SMP architectures. On hardware with more than one CPU, multi-threading provides a mechanism for a program to exploit more than one CPU simultaneously. By creating multiple, simultaneously active threads, the program enables the operating system to schedule these threads concurrently on more

–  –  –

threading capabilities allow an application to exploit the multiple CPUs in an SMP machine, the scalability gains are limited to a single SMP box and cannot leverage distributed computing resources.

Figure 2. SAS V9 Processing Capabilities The initial offering of SAS Version 9 provided both multi-threading and multi-processing capabilities to allow SAS applications to scale-up to take advantage of multi-processors available in SMP hardware.

SAS also provided multi-processing capabilities to allow SAS applications to scale-out to take advantage of any number of distributed computing resources. The next evolutionary step was for SAS to make it possible for customers to run their SAS applications in a grid or cluster environment. SAS Grid Manager was introduced in SAS 9.1.3 to build upon the parallel capabilities of SAS/CONNECT and to add the many other requirements of enterprise grid deployments. SAS Grid Manager provides multi-user load balancing, policy enforcement, efficient resource allocation, and prioritization for SAS products and solutions running in a shared grid environment.

Figure 3. SAS V9.1.3 Can Leverage Grid Processing

SAS Grid Manager has been integrated with many SAS products and solutions to provide seamless grid capabilities to the users of these applications. One such application is SAS Data Integration Studio. This paper will discuss the multiple ways that SAS Data Integration Studio works with SAS Grid manager to bring the benefits of a SAS grid infrastructure to the user in an easy-to-use, point-and-click development environment.

USING SAS GRID MANAGER IN SAS DATA INTEGRATION STUDIO

When running processes in a grid, SAS Grid Manager dynamically determines node availability and monitors grid nodes to determine which node is the best candidate to receive the next workload segment. This determination can be based on many factors, but it often considers the current load under which all grid nodes are running at any given time. The node that has the lowest CPU load becomes the best candidate on which to run the next workload segment. This dynamic capability greatly increases job runtime performance by distributing processes across a wider array of resources capable of handling the greater computing load.

SAS Global Forum 2008 Data Integration There are three key ways that SAS programs and applications can exploit grid computing using SAS Grid

Manager in SAS Data Integration Studio:

• Distributed Enterprise Scheduling—distribute scheduled jobs to a shared pool of resources in a grid





• Multi-user Workload Balancing—distribute user-submitted jobs interactively to a shared pool of resources in a grid

• Parallel Workload Balancing—distribute parallelized jobs, either scheduled or user submitted, to a shared pool of resources in a grid Figure 4 below describes the SAS products that can exploit each of the features. SAS Data Integration Server can exploit all three capabilities.

Figure 4. Products in SAS That Can Exploit SAS Grid Computing Distributed Enterprise Scheduling Using Distributed Enterprise Scheduling with SAS Grid Manager, scheduled jobs are targeted to run on the grid.

This makes all resources in the grid available to run scheduled jobs. The scheduling server manages the resources in the grid workload so that jobs are efficiently distributed across the available machines in the grid.

You can leverage this capability in SAS Data Integration Studio and other SAS products by deploying the job to be scheduled and then using the scheduling server that manages the grid to schedule and run the jobs.

SAS Global Forum 2008 Data Integration Figure 5. Using Schedule Manager for Grid Computing Multi-User Workload Balancing With Multi-User Workload balancing, individual users have the ability to submit jobs interactively and directly to the grid. For example, a site could have a number of users that do ad hoc development, such as model development, queries, and other sorts of discovery and analysis. SAS Grid Manager provides the ability to leverage the grid when submitting jobs. Using the grid provides all of the capabilities of load balancing, such as queuing, prioritization, workload balancing, and resource management, for this type of interactive submit.

High priority jobs can even preempt lower priority work so that the most critical business processes execute first. This enables users to leverage all of the available resources in their distributed environment for job processing, thereby speeding up long-running tasks and increasing user productivity.

SAS Data Integration Studio 4.2 supports this capability by enabling you to select the target server where you want to submit jobs or transformation steps in a job. The target server can be the grid server. When selected, Data Integration Studio wrappers the submitted code with the appropriate statements to submit the interactive job to the grid.

Figure 6. Using SAS Data Integration Studio's server selection capabilities to submit jobs to a Grid SAS Grid Manager also supports the option to group like resources together into a group of nodes.

For example, an administrator might want to configure one set of nodes for work with analytical applications, and another set of nodes for data integration processes. The administrator can assign a name to this partition so that users can specify it when submitting their processes. This partitioning ability allows the administrator to tailor the grid to better meet the needs of the user community. Data Integration Studio allows you to specify an optional grid partition when you submit processes using Workload balancing.

SAS Global Forum 2008 Data Integration Figure 7. Data Integration Studio Supports Grid Partitions Parallel Workload Balancing SAS Data Integration Studio also supports the ability to parallelize processes, which can be submitted interactively using Parallel Workload balancing or can be scheduled as job flows using Distributed Enterprise Scheduling to run in parallel on a grid. Parallel execution of job flows is supported using iteration, a scenario frequently found in Data Integration processing. Iteration can be explained with a simple example.

Sometimes it is desirable to execute the same process flow over and over again on different data. For example, suppose you have United States Census data as a set of 50 tables, one for each state. The table structure is

identical, but the data is specific to each state:

• HouseholdsCA

• HouseholdsTX

• HouseholdsAZ

• HouseholdsNM …additional tables Now suppose you want to calculate the number of households that own more than one acre of land. For each state of data you would run the same process on every source table. One way to process this data would be to run the process one run at a time per table. However, if you have the computing resources that would be available using a grid, you could run the same process in parallel on the different source tables. This is iteration, which means that the same process flow is run iteratively for each input in sequence. The scope of the iteration is called a loop. SAS Data Integration Studio includes a methodology for handling looping for jobs, both serially and in parallel. Each run is called an iteration of the job, and it can be submitted to run in parallel, either on a single machine, or submitted to a grid to leverage the multiple computing nodes available on a grid.

Figure 8. SAS Data Integration Studio Job That Supports Parallel Processing of Data, One Process per Loop

–  –  –

configuration so that job submissions are sent only to the right set of available servers, sign-on retries, and other features.

Figure 9. Options Available in the Loop Transform for Iteration

PERFORMANCE CONSIDERATIONS

It is useful to understand how a grid and parallelization can improve job time performance. Taking the census example, some states have a larger number of households than other states. The performance of each job element therefore varies by state. We took a baseline test and benchmarked the runtimes for each state during sequential execution to come up with the following performance workload running in a test environment of 10 blade servers with 60GB of data.

Figure 10. Runtimes for Processing Census Data by State

Running these jobs serially took around 600 minutes of real time on our test environment. We then ran the same jobs in parallel using the Loop transform. We were able to achieve a best-case runtime in the test environment of approximately 100 minutes, a 6x performance gain over the serial case.

SAS has published a number of grid performance benchmarks for different scenarios, see the reference list at the end of this paper for details. Benchmark tests are available for a variety of customer-usage patterns, including a large, multi-user, ad hoc analytics environment and an I/O intensive scenario. These scenarios varied the number of available grid nodes and I/O capabilities to determine performance patterns with some encouraging results. In the computational scenario, the benchmarks were able to achieve linear scalability as work load increased simply by increasing the available computer resources, that is, adding nodes into the grid.

Similarly with I/O intensive processes, increasing the available I/O resources enabled the processes to scale linearly in the grid environment.



Pages:   || 2 |


Similar works:

«DMAX Programm Programmwoche 23, 02.06. bis 08.06.2012 Highlight der Woche: Der letzte Aufstieg Am Sonntag, 03.06.2012 ab 20:15 Uhr Andere Jungs in seinem Alter interessieren sich im Normalfall für Mädchen und die ersten ausschweifenden Partynächte. Geordie Stewart verbringt seine frei Zeit lieber bei -30 Grad auf dem Mount Everest. Geordie Stewart ist besessen vom Bergsteigen. 2007 schmiedet er als 17-Jähriger, inspiriert von Bear Grylls Bestseller „Facing Up“, erste Pläne für ein...»

«Call for Entries 2015 Einreichen bis 22. Jänner 2015.Die CCA-Venus: Innovation in der Werbung. Innovation im Wettbewerb. Der CCA-Wettbewerb will Österreichs Werbung besser machen, indem der größte Kreativ-Preis des Landes den innovativsten Ideen und ihren Macherinnen und Machern eine Bühne gibt. Das hat bei uns schon Tradition: Die CCA-Venus verleihen wir heuer bereits zum 44. Mal. Was ist Gold wert? Die Experten-Jury sucht nach der Kombination einer innovativen Idee mit ausgezeichnetem...»

«KURZFA SSUNGEN 19. – 20. März 2013 Kongresshaus Baden-Baden Jahrestreffen der Fachgruppen Extraktion und Mehrphasenströmungen www.processnet.org/ext_mph_13 © Kongresshaus Baden Baden veranstalter Charakterisierung des Koaleszenzverhaltens in Extraktionskolonnen Nicole Kopriwa, Andreas Pfennig; TU Graz, Graz, Österreich Die Messung des Koaleszenzverhaltens von Systemen ist schwierig, da die Koaleszenz stark von Spurenverunreinigungen beeinflusst wird. Die Beschreibung der Koaleszenz ist...»

«Klosterrunsstr. 17 79379 Müllheim Tel: 07631/174460 Fax: 07631/174031 Internet: www.anna-consult.de Mail: team@anna-consult.de Modellierung von Bodenerosion/Oberflächenabfluss in den Kleineinzugsgebieten Schlossberg-Maiertal der Gemarkung Friesenheim und in Efringen-Kirchen im Landkreis Lörrach mit dem Computermodell Erosion-3D Roland Cesarz, Thomas Hölscher, Karl Müller-Sämann ANNA – Agentur für Nachhaltige Nutzung von Agrarlandschaften Juli 2006 Im Auftrag des: Landes...»

«TEXTS AND TRANSLATIONS — UO CONCERT CHOIR & REPERTOIRE SINGERS Zigeunerlieder Op. 103 Johannes Brahms He, Zigeuner, greife in die Saiten He, Zigeuner, greife in die Saiten ein! Hey, Gypsy, strike upon your strings! Spiel das Lied vom ungetreuen Mägdelein! Play the song of the faithless young girl! Laß die Saiten weinen, klagen, traurig bange, Let the strings weep complain, sadly quiver, Bis die heiße Träne netzet diese Wange! Until the hot tears flow down this cheek! Hochgetürmte...»

«Andrea Lauser Beziehungsnetzwerke, Frauenraum und ein wenig Heimat Ein Asian Food Store als Treffpunkt von Filipinas in einer deutschen Großstadt. In meinem Beitrag verfolge ich zwei Argumentationslinien. Zum einen möchte ich dem gängigen Klischee der philipinischen (Heirats-) Migrantin als ohnmächtiges, isoliertes und ausgeliefertes Opfer eine Sichtweise gegenüberstellen, in der Macht und Ohnmacht vielfältig miteinander verwobene Kräfte darstellen. Indem ich einen Einkaufsladen ins...»

«Stand: 25.03.14 Modulhandbuch SoSe 2014, Teil (b) für den konsekutiven Masterstudiengang Geowissenschaften mit den Vertiefungsrichtungen Geologie, Geophysik und Mineralogie/Petrologie an der Universität Potsdam Inhalt Modulbeschreibungen des Masterstudiums (1) Masterstudiengang Geowissenschaften mit Vertiefungsrichtung Geologie (2) Masterstudiengang Geowissenschaften mit Vertiefungsrichtung Geophysik (3) Masterstudiengang Geowissenschaften mit Vertiefungsrichtung Mineralogie/Petrologie...»

«STADT BIELEFELD Sitzung Bezirksvertretung Jöllenbeck Nr. BVJö/021/ Niederschrift über die Sitzung der Bezirksvertretung Jöllenbeck am 09.02.2012 Tagungsort: Aula der Realschule Jöllenbeck Beginn: 17:00 Uhr Sitzungspause: Ende: 19:30 Uhr Anwesend: Vorsitz Herr Jens Julkowski-Keppler Bezirksvorsteher, Ratsmitglied CDU Herr Erwin Jung Ratsmitglied Herr Hans-Jürgen Kleimann Herr Peter Kraiczek Vorsitzender Frau Heidemarie Lämmchen Frau Brigitte Otto SPD Herr Michael Bartels Frau Dorothea...»

«Archiv für Sozialgeschichte Herausgegeben von der Friedrich-Ebert-Stiftung in Verbindung mit dem Institut für Sozialgeschichte e.V. Braunschweig – Bonn 44. Band · 2004 Verlag J.H.W. Dietz Nachf.REDAKTION: BEATRIX BOUVIER DIETER DOWE PATRIK VON ZUR MÜHLEN MICHAEL SCHNEIDER SCHRIFTLEITUNG: FRIEDHELM BOLL REDAKTIONSASSISTENZ: ANJA KRUKE Redaktionsanschrift: Institut für Sozialgeschichte Godesberger Allee 149, 53175 Bonn Tel. 02 28/88 34 70, Fax: 02 28/88 34 97 e-mail: AfS@FES.de Frau...»

«Simulation und Messung von Zirkulationsund Transportprozessen im Greifswalder Bodden, Oderästuar und den angrenzenden Küstengewässern K. Buckmann, U. Gebhardt, A. Weidauer IfGDV Institut für Geographische Datenverarbeitung K.D. Pfeiffer, K. Duwe, J. Post, A. Fey, B. Hellmann HYDROMOD Wissenschaftliche Beratung GbR in Zusammenarbeit mit dem Institut für Geographie der Ernst-Moritz-Arndt-Universität Greifswald Hinrichshagen, Wedel 1999 Copyright © HYDROMOD, IfGDV 1998/99 1....»





 
<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.