
Copyright © National Academy of Sciences. All rights reserved.
The Future of Computing Performance: Game Over or Next Level?
THE END OF PROGRAMMING AS WE KNOW IT 125
MapReduce: Exploiting Data Parallelism
and Distributed Computation
MapReduce is a data-processing infrastructure
19
developed internally
by Google and later popularized in the Hadoop open-source version.
20
MapReduce is targeted at large-scale data-parallel workloads in which
the input is represented as a set of key-value pairs and computation is
expressed as a sequence of two user-provided functions: map and reduce.
The Map function processes the input to create an intermediate key-value
pair. Intermediate pairs with the same key are aggregated by the system
and fed to the reduce function, which produces the final output.
What makes MapReduce particularly compelling is that it frees
the programmer from the need to worry about much of the complex-
ity required to run large-scale computations. The programmer needs
only to produce the body of the two MapReduce methods, and the sys-
tem takes care of parallelization, data distribution, fault tolerance, and
load-balancing.
The class of problems that can be mapped to this relatively simple
processing framework is surprisingly broad. MapReduce was conceived
to simplify the data-processing involved in creating a Web index from
a set of crawled documents, but developers have also used it for large-
scale graph algorithms (for example, finding citation occurrences among
scientific articles), computing query-term popularity (for example, Google
Trends
21
), and creating language models for statistical machine transla-
tion (for example, finding and keeping a count for every unique sequence
of five or more terms in a large corpus), and other applications. Within
Google itself, MapReduce’s popularity has exploded since its introduction.
MapReduce has also been found to be useful for systems much smaller
than the large distributed clusters for which it was designed. Ranger et al.
examined the adequacy of the MapReduce framework for multicore and
multiprocessor systems
22
and found that it was equally compelling as a
programming system for this class of machines. In a set of eight applica-
tions both coded with a version of MapReduce (the Phoenix runtime)
and hand-coded directly, the authors found that MapReduce-coded ver-
19
See Jeffrey Dean and Sanjay Ghemawat, 2008, MapReduce: Simplified data processing on
large clusters, Communications of the ACM 51(1): 107-113, and Micheal Noth, 2006, Building
a computing system for the world’s information, Invited talk, University of Iowa, IT Tech
Forum Presentations, April 20, 2006.
20
See The Apache Hadoop project, available at http://hadoop.apache.org.
21
See Google trends, available at http://trends.google.com.
22
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis,
2007, Evaluating MapReduce for multi-core and multiprocessor systems, Proceedings of the
IEEE 13th International Symposium on High-Performance Computer Architecture, Phoenix,
Ariz., February 10-14, 2007.