«MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which has been used for applications such as ...»
Distributed Data Management Using MapReduce
FENG LI, National University of Singapore
BENG CHIN OOI, National University of Singapore
M. TAMER ÖZSU, University of Waterloo
SAI WU, Zhejiang University
MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which
has been used for applications such as generating search indexes, document clustering, access log analysis,
and various other forms of data analytics. MapReduce adopts a ﬂexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research efforts have been directed towards making it more usable and efﬁcient for supporting database-centric operations. In this paper we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.
ACM Reference Format:
Li, F., Ooi, B-C., Özsu, M. T., Wu, S. 2013. Distributed Data Management Using MapReduce. ACM Comput.
Surv. 0, 0, Article A ( 0), 41 pages.
DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
1. INTRODUCTION Database management systems (DBMSs) have become a ubiquitous operational platform in managing huge amounts of business data. DBMSs have evolved over the last four decades and are now functionally rich. While enterprises are struggling with the problem of poor database scalability, a new challenge has emerged that has impacted the IT infrastructures of many modern enterprises. DBMSs have been criticized for their monolithic architecture that is claimed to make them “heavyweight” and expensive to operate. It is sometimes argued that they are not efﬁcient for many data management tasks despite their success in business data processing. This challenge has been labeled as the big data problem. In principle, while earlier DBMSs focused on modeling operational characteristics of enterprises, big data systems are now expected to model user behaviors by analyzing vast amounts of user interaction logs. There have been various proposals to restructure DBMSs (e.g., [Chaudhuri and Weikum 2000;
Stonebraker et al. 2007]), but the basic architecture has not changed dramatically.
With the increasing amount of data and the availability of high performance and relatively low-cost hardware, database systems have been extended and parallelized to run on multiple hardware platforms to manage scalability [Özsu and Valduriez 2011].
Recently, a new distributed data processing framework called MapReduce was proposed [Dean and Ghemawat 2004] whose fundamental idea is to simplify the parallel processing using a distributed computing platform that offers only two interfaces:
Authors’ addresses: F. Li and B-C. Ooi, School of Computing, National University of Singapore, Singapore;
M. T. Özsu, Cheriton School of Computer Science, University of Waterloo, Canada; S. Wu, Department of Computer Science, Zhejiang University, China.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies show this notice on the ﬁrst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior speciﬁc permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or firstname.lastname@example.org.
c 0 ACM 0360-0300/0/-ARTA $15.00 DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 ACM Computing Surveys, Vol. 0, No. 0, Article A, Publication date: 0.
A:2 Feng Li et al.
map and reduce. Programmers implement their own map and reduce functions, while the system is responsible for scheduling and synchronizing the map and reduce tasks.
MapReduce model can be used to solve the “embarrassingly parallel” problems1, where little or no effort is required to partition a task into a number of parallel but smaller tasks. MapReduce is being used increasingly in applications such as data mining, data analytics and scientiﬁc computation. Its wide adoption and success lies in its distinguishing features, which can be summarized as follows.
(1) Flexibility. Since the code for map and reduce functions are written by the user, there is considerable ﬂexibility in specifying the exact processing that is required over the data rather than specifying it using SQL. Programmers can write simple map and reduce functions to process petabytes of data on thousands of machines without the knowledge of how to parallelize the processing of a MapReduce job.
(2) Scalability. A major challenge in many existing applications is to be able to scale to increasing data volumes. In particular, elastic scalability is desired, which requires the system to be able to scale its performance up and down dynamically as the computation requirements change. Such a “pay-as-you-go” service model is now widely adopted by the cloud computing service providers, and MapReduce can support it seamlessly through data parallel execution. MapReduce was successfully deployed on thousands of nodes and able to handle petabytes of data.
(3) Efﬁciency. MapReduce does not need to load data into a database, which typically incurs high cost. It is, therefore, very efﬁcient for applications that require processing the data only once (or only a few times).
(4) Fault tolerance. In MapReduce, each job is divided into many small tasks that are assigned to different machines. Failure of a task or a machine is compensated by assigning the task to a machine that is able to handle the load. The input of a job is stored in a distributed ﬁle system where multiple replicas are kept to ensure high availability. Thus, the failed map task can be repeated correctly by reloading the replica. The failed reduce task can also be repeated by re-pulling the data from the completed map tasks.
The criticisms of MapReduce center on its reduced functionality, requiring considerable amount of programming effort, and its unsuitability for certain types of applications (e.g., those that require iterative computations) [DeWitt et al. 2008; Dewitt and Stonebraker 2009; Pavlo et al. 2009; Stonebraker et al. 2010]. MapReduce does not require the existence of a schema and does not provide a high-level language such as SQL. The ﬂexibility advantage mentioned above comes at the expense of considerable (and usually sophisticated) programming on the part of the user. Consequently, a job that can be performed using relatively simple SQL commands may require considerable amount of programming in MapReduce, and this code is generally not reusable. Moreover, MapReduce does not have built-in indexing and query optimization support, always resorting to scans. The potential performance drawback of MapReduce has been reported [Pavlo et al. 2009] on the basis of experiments on two benchmarks – TPC-H and a customized benchmark tailored for search engines. In a 100-node cluster, a parallel database system and a column-wise data manage system called Vertica (http://www.vertica.com) show superior performance than Hadoop (http://hadoop.apache.org/) implementation of MapReduce for various workloads, including simple grep, join and aggregation jobs.
Since the introduction of MapReduce, there have been a long stream of research that attempt to address the problems highlighted above, and this indeed remains an active area of research. Considerable effort has been spent on efﬁcient implementation of the 1 http://parlab.eecs.berkeley.edu/wiki/media/patterns/map-reduce-pattern.doc
MapReduce framework. There have been proposals for more sophisticated scheduling algorithms [Zaharia et al. 2008] and parsing schemes [Jiang et al. 2010] to improve performance. There have also been a number of works to extend the framework to support more complex applications [Condie et al. 2010; Bu et al. 2010]. High level declarative (Hive [Thusoo et al. 2009] and Pig [Olston et al. 2008]), and procedural languages (Sawzall [Pike et al. 2005]) as well as a Java library (FlumeJava [Chambers et al. 2010]), have also been proposed for the MapReduce framework to improve its ease of use (Section 3.2).
The focus of this survey is on large-scale data processing using MapReduce. The ease with which MapReduce programs can be parallelized has caused the migration of a number of applications to the MapReduce platform. There are two aspects of supporting DBMS functionality over MapReduce. The ﬁrst aspect is the implementation of database operators, such as select, project, etc, as MapReduce jobs, and speciﬁc indexes [Dittrich et al. 2010] to support these implementations (Section 4).
The second aspect is to combine these implementations to create a fully-functional DBMS/data warehouse on MapReduce (Section 5). Example MapReduce-based DBMS implementations include HadoopDB [Abouzeid et al. 2009], Llama [Lin et al. 2011], MRShare [Nykiel et al. 2010] and Cheetah [Chen 2010]. In addition, traditional database systems sometimes provide MapReduce as a built-in feature (e.g., Greenplum and Aster) [Friedman et al. 2009].
Within this context, the objectives of this survey are four-fold. First we introduce this new and increasingly widely deployed distributed computing paradigm (Section 2) and its current implementations (Section 3). Second, we present the current research on enhancing MapReduce to better address modern data intensive applications without losing its fundamental advantages (Sections 4 and 5). Third, we discuss ongoing work in extending MapReduce to handle a richer set of workloads such as streaming data, iterative computations (Section 6). Finally, we brieﬂy review a number of recent systems that may have been inﬂuenced by MapReduce (Section 7). We assume that the readers are familiar with basic data management terminology and concepts.
2. MAPREDUCE TECHNOLOGY
2.1. MapReduce Programming Model MapReduce is a simpliﬁed parallel data processing approach for execution on a computer cluster [Dean and Ghemawat 2004]. Its programming model consists of two user deﬁned functions, map and reduce2 (Table I).
The inputs of the map function is a set of key/value pairs. When a MapReduce job is submitted to the system, the map tasks (which are processes that are referred to as mappers) are started on the compute nodes and each map task applies the map function to every key/value pair (k1, v1) that is allocated to it. Zero or more intermediate key/value pairs (list(k2, v2)) can be generated for the same input key/value pair. These 2 Asa convention, we will use courier font when we refer to the speciﬁc function or interface, and the regular font when we refer to the processing of the corresponding function.
ALGORITHM 1: Map Function for UserVisits input: String key, String value String array = value.split(”|”);
ALGORITHM 2: Reduce Function for UserVisits input: String key, Iterator values ﬂoat totalRevenue = 0;
while values.hasNext() do totalRevenue += values.next();
end Emit(key, totalRevenue);
intermediate results are stored in the local ﬁle system and sorted by the keys. After all the map tasks complete, the MapReduce engine notiﬁes the reduce tasks (which are also processes that are referred to as reducers) to start their processing. The reducers will pull the output ﬁles from the map tasks in parallel, and merge-sort the ﬁles obtained from the map tasks to combine the key/value pairs into a set of new key/value pair (k2, list(v2)), where all values with the same key k2 are grouped into a list and used as the input for the reduce function. The reduce function applies the user-deﬁned processing logic to process the data. The results, normally a list of values, are written back to the storage system.