FREE ELECTRONIC LIBRARY - Books, dissertations, abstract

Pages:   || 2 | 3 | 4 | 5 |   ...   | 11 |

«MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which has been used for applications such as ...»

-- [ Page 1 ] --


Distributed Data Management Using MapReduce

FENG LI, National University of Singapore

BENG CHIN OOI, National University of Singapore

M. TAMER ÖZSU, University of Waterloo

SAI WU, Zhejiang University

MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which

has been used for applications such as generating search indexes, document clustering, access log analysis,

and various other forms of data analytics. MapReduce adopts a flexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research efforts have been directed towards making it more usable and efficient for supporting database-centric operations. In this paper we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.

ACM Reference Format:

Li, F., Ooi, B-C., Özsu, M. T., Wu, S. 2013. Distributed Data Management Using MapReduce. ACM Comput.

Surv. 0, 0, Article A ( 0), 41 pages.

DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

1. INTRODUCTION Database management systems (DBMSs) have become a ubiquitous operational platform in managing huge amounts of business data. DBMSs have evolved over the last four decades and are now functionally rich. While enterprises are struggling with the problem of poor database scalability, a new challenge has emerged that has impacted the IT infrastructures of many modern enterprises. DBMSs have been criticized for their monolithic architecture that is claimed to make them “heavyweight” and expensive to operate. It is sometimes argued that they are not efficient for many data management tasks despite their success in business data processing. This challenge has been labeled as the big data problem. In principle, while earlier DBMSs focused on modeling operational characteristics of enterprises, big data systems are now expected to model user behaviors by analyzing vast amounts of user interaction logs. There have been various proposals to restructure DBMSs (e.g., [Chaudhuri and Weikum 2000;

Stonebraker et al. 2007]), but the basic architecture has not changed dramatically.

With the increasing amount of data and the availability of high performance and relatively low-cost hardware, database systems have been extended and parallelized to run on multiple hardware platforms to manage scalability [Özsu and Valduriez 2011].

Recently, a new distributed data processing framework called MapReduce was proposed [Dean and Ghemawat 2004] whose fundamental idea is to simplify the parallel processing using a distributed computing platform that offers only two interfaces:

Authors’ addresses: F. Li and B-C. Ooi, School of Computing, National University of Singapore, Singapore;

M. T. Özsu, Cheriton School of Computer Science, University of Waterloo, Canada; S. Wu, Department of Computer Science, Zhejiang University, China.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.

c 0 ACM 0360-0300/0/-ARTA $15.00 DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 ACM Computing Surveys, Vol. 0, No. 0, Article A, Publication date: 0.

A:2 Feng Li et al.

map and reduce. Programmers implement their own map and reduce functions, while the system is responsible for scheduling and synchronizing the map and reduce tasks.

MapReduce model can be used to solve the “embarrassingly parallel” problems1, where little or no effort is required to partition a task into a number of parallel but smaller tasks. MapReduce is being used increasingly in applications such as data mining, data analytics and scientific computation. Its wide adoption and success lies in its distinguishing features, which can be summarized as follows.

(1) Flexibility. Since the code for map and reduce functions are written by the user, there is considerable flexibility in specifying the exact processing that is required over the data rather than specifying it using SQL. Programmers can write simple map and reduce functions to process petabytes of data on thousands of machines without the knowledge of how to parallelize the processing of a MapReduce job.

(2) Scalability. A major challenge in many existing applications is to be able to scale to increasing data volumes. In particular, elastic scalability is desired, which requires the system to be able to scale its performance up and down dynamically as the computation requirements change. Such a “pay-as-you-go” service model is now widely adopted by the cloud computing service providers, and MapReduce can support it seamlessly through data parallel execution. MapReduce was successfully deployed on thousands of nodes and able to handle petabytes of data.

(3) Efficiency. MapReduce does not need to load data into a database, which typically incurs high cost. It is, therefore, very efficient for applications that require processing the data only once (or only a few times).

(4) Fault tolerance. In MapReduce, each job is divided into many small tasks that are assigned to different machines. Failure of a task or a machine is compensated by assigning the task to a machine that is able to handle the load. The input of a job is stored in a distributed file system where multiple replicas are kept to ensure high availability. Thus, the failed map task can be repeated correctly by reloading the replica. The failed reduce task can also be repeated by re-pulling the data from the completed map tasks.

The criticisms of MapReduce center on its reduced functionality, requiring considerable amount of programming effort, and its unsuitability for certain types of applications (e.g., those that require iterative computations) [DeWitt et al. 2008; Dewitt and Stonebraker 2009; Pavlo et al. 2009; Stonebraker et al. 2010]. MapReduce does not require the existence of a schema and does not provide a high-level language such as SQL. The flexibility advantage mentioned above comes at the expense of considerable (and usually sophisticated) programming on the part of the user. Consequently, a job that can be performed using relatively simple SQL commands may require considerable amount of programming in MapReduce, and this code is generally not reusable. Moreover, MapReduce does not have built-in indexing and query optimization support, always resorting to scans. The potential performance drawback of MapReduce has been reported [Pavlo et al. 2009] on the basis of experiments on two benchmarks – TPC-H and a customized benchmark tailored for search engines. In a 100-node cluster, a parallel database system and a column-wise data manage system called Vertica (http://www.vertica.com) show superior performance than Hadoop (http://hadoop.apache.org/) implementation of MapReduce for various workloads, including simple grep, join and aggregation jobs.

Since the introduction of MapReduce, there have been a long stream of research that attempt to address the problems highlighted above, and this indeed remains an active area of research. Considerable effort has been spent on efficient implementation of the 1 http://parlab.eecs.berkeley.edu/wiki/media/patterns/map-reduce-pattern.doc

–  –  –

MapReduce framework. There have been proposals for more sophisticated scheduling algorithms [Zaharia et al. 2008] and parsing schemes [Jiang et al. 2010] to improve performance. There have also been a number of works to extend the framework to support more complex applications [Condie et al. 2010; Bu et al. 2010]. High level declarative (Hive [Thusoo et al. 2009] and Pig [Olston et al. 2008]), and procedural languages (Sawzall [Pike et al. 2005]) as well as a Java library (FlumeJava [Chambers et al. 2010]), have also been proposed for the MapReduce framework to improve its ease of use (Section 3.2).

The focus of this survey is on large-scale data processing using MapReduce. The ease with which MapReduce programs can be parallelized has caused the migration of a number of applications to the MapReduce platform. There are two aspects of supporting DBMS functionality over MapReduce. The first aspect is the implementation of database operators, such as select, project, etc, as MapReduce jobs, and specific indexes [Dittrich et al. 2010] to support these implementations (Section 4).

The second aspect is to combine these implementations to create a fully-functional DBMS/data warehouse on MapReduce (Section 5). Example MapReduce-based DBMS implementations include HadoopDB [Abouzeid et al. 2009], Llama [Lin et al. 2011], MRShare [Nykiel et al. 2010] and Cheetah [Chen 2010]. In addition, traditional database systems sometimes provide MapReduce as a built-in feature (e.g., Greenplum and Aster) [Friedman et al. 2009].

Within this context, the objectives of this survey are four-fold. First we introduce this new and increasingly widely deployed distributed computing paradigm (Section 2) and its current implementations (Section 3). Second, we present the current research on enhancing MapReduce to better address modern data intensive applications without losing its fundamental advantages (Sections 4 and 5). Third, we discuss ongoing work in extending MapReduce to handle a richer set of workloads such as streaming data, iterative computations (Section 6). Finally, we briefly review a number of recent systems that may have been influenced by MapReduce (Section 7). We assume that the readers are familiar with basic data management terminology and concepts.


2.1. MapReduce Programming Model MapReduce is a simplified parallel data processing approach for execution on a computer cluster [Dean and Ghemawat 2004]. Its programming model consists of two user defined functions, map and reduce2 (Table I).

The inputs of the map function is a set of key/value pairs. When a MapReduce job is submitted to the system, the map tasks (which are processes that are referred to as mappers) are started on the compute nodes and each map task applies the map function to every key/value pair (k1, v1) that is allocated to it. Zero or more intermediate key/value pairs (list(k2, v2)) can be generated for the same input key/value pair. These 2 Asa convention, we will use courier font when we refer to the specific function or interface, and the regular font when we refer to the processing of the corresponding function.

–  –  –

ALGORITHM 1: Map Function for UserVisits input: String key, String value String[] array = value.split(”|”);


ALGORITHM 2: Reduce Function for UserVisits input: String key, Iterator values float totalRevenue = 0;

while values.hasNext() do totalRevenue += values.next();

end Emit(key, totalRevenue);

intermediate results are stored in the local file system and sorted by the keys. After all the map tasks complete, the MapReduce engine notifies the reduce tasks (which are also processes that are referred to as reducers) to start their processing. The reducers will pull the output files from the map tasks in parallel, and merge-sort the files obtained from the map tasks to combine the key/value pairs into a set of new key/value pair (k2, list(v2)), where all values with the same key k2 are grouped into a list and used as the input for the reduce function. The reduce function applies the user-defined processing logic to process the data. The results, normally a list of values, are written back to the storage system.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 11 |

Similar works:

«Export in das PDF-Dateiformat [Export in das PDF-Dateiformat] [0.2] Erste Version: [2007-01-16] Erste Deutsche Version: [16.01.2007] Export in das PDF-Dateiformat Export in das PDF-Dateiformat Inhaltsverzeichnis Einleitung Die verschiedenen Wege zum Export Direkter Export in das PDF-Format Export über das Menü Datei in das PDF-Format Register Allgemein Register Anfangsdarstellung Register Benutzerschnittstelleoberfläche Register Sicherheit Register „Verknüpfungen“ Ändern der Funktion...»

«Oral Tradition, 9/2 (1994): 255-280 Informing Performance: Producing the Coloquio in Tierra Blanca Richard Bauman and Pamela Ritch The Albert Lord and Milman Parry Lecture for 19921 The attractions of performance as a focus of anthropological and folkloristic analysis are many, but in general terms they derive from four characteristic properties of performance: 1. performances are artful, the locus of aesthetic behaviors, forms, responses, and values, as enacted in social life; 2. performances...»

«Assistance Work in the Human Resources Group, GS-0200 December 2000 Job Family Position Classification Standard for Assistance Work in the Human Resources Management Group, GS-0200 Series Covered by This Standard: Human Resources Assistance, GS-0203 Table of Contents INTRODUCTION COVERAGE MODIFICATIONS AND CANCELLATIONS TO OTHER EXISTING OCCUPATIONAL SERIES AND STANDARDS GENERAL SERIES, TITLING, AND OCCUPATIONAL GUIDANCE HUMAN RESOURCES ASSISTANCE, GS-0203 DISTINCTIONS BETWEEN ASSISTANT WORK...»

«The Catholic Encyclopedia, Volume 2: AssizesBrowne Author(s): Herbermann, Charles George (1840-1916) Publisher: Subjects: Christian Denominations Roman Catholic Church Dictionaries. Encyclopedias i Contents Assizes to Baal 1 Baal to Browne 453 Indexes 2088 Subject Index 2089 Index of Scripture References 2090 ii This PDF file is from the Christian Classics Ethereal Library, www.ccel.org. The mission of the CCEL is to make classic Christian books available to the world. • This book is...»

«VON ALLGEMEINEN THEORIEN DER GEDÄCHTNISENTWICKLUNG ZUR ANALYSE SPEZIFISCHER LERNUND ERINNERUNGSVORGÄNGE Weinert, Franz E•• Knopf. Monika und Schneider, Wolfgang Schon wenige Jahre nach Beginn der experimentellen Lernforschung Ende des vorigen Jahrhunderts schienen die wissenschaftlichen Probleme der Gedächtnisentwicklung im Verlauf des menschlichen Lebens gelöst zu sein. 1929 konstruierte Thorndike eine allgemeine Entwicklungskurve zur Beschreibung des modalen Zusammenhangs zwischen...»

«Das Verhältnis von Christentum und Hinduismus im heutigen indisch-christlichen theologischen Denken. Ein Beispiel: Raimundo Panikkar* I. Raimundo Panikkar ist der heutzutage wohl bekannteste indischchristliche Theologe. Er wurde im Jahr 1918 in Spanien geboren. Sein Vater war ein Hindu und seine Mutter eine Spanierin. So war die Frage nach seiner Identität für ihn frühzeitig auch eine wichtige theologische Frage. Ohne diesen biographischen Hintergrund ist es schwierig, seine theologischen...»

«Every finite semigroup is embeddable in a finite relatively free semigroup George M. Bergman Department of Mathematics, University of California, Berkeley, CA 94720-3840, U.S.A. Abstract The title result is proved by a Murskii-type embedding. Results on some related questions are also obtained. For instance, it is shown that every finitely generated semigroup satisfying an identity ξ d = ξ 2d is embeddable in a relatively free semigroup satisfying such an identity, generally with a larger...»

«Dissertation zur Erlangung des Doktorgrades der Fakultät für Chemie und Pharmazie der Ludwig-Maximilians-Universität München Molecular basis of RNA polymerase III transcription repression by Maf1 & Structure of human mitochondrial RNA polymerase Eva Rieke Ringel aus Essen Dissertation zur Erlangung des Doktorgrades der Fakultät für Chemie und Pharmazie der Ludwig-Maximilians-Universität München Molecular basis of RNA polymerase III transcription repression by Maf1 & Structure of human...»

«Emergence of new signal-primitives in neural systems Peter CARIANI Intellectica, 1997/2: 95-143. (Pagination differs from the original.) Eaton Peabody Laboratory for Auditory Physiology Massachusetts Eye & Ear Infirmary, 243 Charles St, Boston MA 02114 USA www.cariani.com Abstract Emergence is the process by which new structures and functions come into being. There are two fundamental, but complementary, conceptions of emergence: combinatoric emergence, wherein novelty arises by new...»

«Pacific University WORK-STUDY PROGRAM 2011-2012 Student Participation Manual Pacific University Career Development Center 2043 College Way, Chapman Hall Forest Grove, OR 97116 INTRODUCTION The Pacific University Work-Study Student Employment Program is administered by the Career Development Center. Every effort is made to unite the educational goals of Work-Study with our Career Center mission of collaborating with fellow educators and community members to provide a transformative blend of...»

<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.