Fuller S.H., Millett L.I. The Future of Computing Performance: Game Over or Next Level?

Подождите немного. Документ загружается.

The Future of Computing Performance: Game Over or Next Level?

126 THE FUTURE OF COMPUTING PERFORMANCE

sions roughly matched the performance of the hand-coded versions for

ﬁve. The remaining three applications did not ﬁt well with MapReduce’s

key-value data-stream model, and the hand-coded versions performed

signiﬁcantly better.

Despite MapReduce’s success as a programming system beyond its

initial application domain and machine-architecture context, it is far from

a complete solution for extracting and managing parallelism in general. It

remains limited to, for example, batch-processing systems and is therefore

not suitable for many on-line serving systems. MapReduce also does not

extract implicit parallelism from an otherwise sequential algorithm but

instead facilitates the partitioning, distribution, and runtime management

of an application that is already essentially data-parallel. Such systems as

MapReduce and NVIDIA’s CUDA,

however, point to a solution strategy

for the general programming challenge of large-scale parallel systems.

The solutions are not aimed at a single programming paradigm for all

possible cases but are based on a small set of programming systems that

can be specialized for particular types of applications.

Distributed Computation—Harnessing the

World’s Spare Computing Capacity

The increasing quality of Internet connectivity around the globe has

led researchers to contemplate the possibility of harnessing the unused

computing capability of the world’s personal computers and servers to

perform extremely compute-intensive parallel tasks. The most notable

examples have come in the form of volunteer-based initiatives, such

as SETI@home (http://setiathome.berkeley.edu/) and Folding@home

(http://folding.stanford.edu/). The model consists of breaking a very

large-scale computation into subtasks that can operate on relatively few

input and result data, require substantial processing over that input set,

and do not require much (or any) communication between subtasks other

than passing of inputs and results. Those initiatives attract volunteers

(individuals or organizations) that sympathize with their scientiﬁc goals

to donate computing cycles on their equipment to take on and execute a

number of subtasks and return the results to a coordinating server that

coalesces and combines ﬁnal results.

The relatively few success cases of the model have relied not only

on the friendly nature of the computation to be performed (vast data

parallelism with very low communication requirements for each unit of

CUDA is a parallel computing approach aimed at taking advantages of NVIDIA graphi-

cal processing units. For more, see CUDA zone, NVIDIA.com, available at http://www.

nvidia.com/object/cuda_home.html.

The Future of Computing Performance: Game Over or Next Level?

THE END OF PROGRAMMING AS WE KNOW IT 127

computing) but also on the trust of the volunteers that the code is safe to

execute in their machines. For more widespread adoption, this particular

programming model would require continuing improvements in secure

execution technologies, incorporation of an economic model that provides

users an incentive to donate their spare computing capacity, and improve-

ments in Internet connectivity. To some extent, one can consider large

illegitimately assembled networks of hijacked computers (or botnets) to

be an exploitation of this computing model; this exempliﬁes the potential

value of harnessing a large number of well-connected computing systems

toward nobler aims.

Summary Observations

These success stories show that there are already a wide variety of

computational models for parallel computation and that science and

industry are successfully harnessing parallelism in some domains. The

parallelism success stories bode well for the future if we can ﬁnd ways

to map more applications to the models or, for computations that do not

map well to the models, if we can develop new models.

PARALLEL-PROGRAMMING SYSTEMS AND

THE PARALLEL SOFTWARE “STACK”

The general problem of designing parallel algorithms and program-

ming them to exploit parallelism is an extremely important, timely,

and unsolved problem. The vast majority of software in use today is

sequential. Although the previous section described examples of parallel

approaches that work in particular domains, general solutions are still

lacking. Many successful parallel approaches are tied to a speciﬁc type of

parallel hardware (MPI and distributed clusters; storage-cluster architec-

ture, which heavily inﬂuenced MapReduce; openGL and SIMD graphics

processors; and so on). The looming crisis that is the subject of this report

comes down to the question of how to continue to improve performance

scalability as architectures continue to change and as more and more pro-

cessors are added. There has been some limited success, but there is not

yet an analogue of the sequential-programming models that have been so

successful in software for decades.

We know some things about what new parallel-programming

approaches will need. A high-level performance-portable programming

model is the only way to restart the virtuous cycle described in Chapter

2. The new model will need to be portable over successive generations

of chips, multiple architectures, and different kinds of parallel hardware,

and it will need to scale well. For all of those goals to be achieved, the

The Future of Computing Performance: Game Over or Next Level?

128 THE FUTURE OF COMPUTING PERFORMANCE

entire software stack will need to be rethought, and architectural assump-

tions will need to be included in the stack. Indeed, in the future, the term

software stack will be a misnomer. A “software-hardware stack” will be the

norm. The hardware, the programming model, and the applications will

all need to change.

A key part of modern programming systems is the software stack

that executes the program on the hardware. The stack must also allow

reasoning about the ﬁve main challenges to scalable and efﬁcient perfor-

mance: parallelism, communication, locality, synchronization, and load-

balancing. The components of a modern software stack include

· Libraries: Generic and domain-speciﬁc libraries provide appli-

cation programmers with predeﬁned software components that

can be included in applications. Because library software may be

reused in many applications, it is often highly optimized by hand

or with automated tools.

· Compiler: An ahead-of-time or a just-in-time compiler translates

the program into assembly code and optimizes it for the underly-

ing hardware. Just-in-time compilers combine proﬁle data from

the current execution with static program analysis to perform

optimizations.

· Runtime system or virtual machine: These systems manage ﬁne-

grain memory resources, application-thread creation and sched-

uling, runtime proﬁling, and runtime compilation.

· Operating system: The operating system manages processes and

their resources, including coarse-grain memory management.

· Hypervisors: Hypervisors abstract the hardware context to pro-

vide performance portability for operating systems among hard-

ware platforms.

Because programming systems are mostly sequential, the software stack

mostly optimizes and manages sequential programs. Optimizing and

understanding the ﬁve challenges at all levels of the stack for parallel

approaches will require substantial changes in these systems and their

interfaces, and perhaps researchers should reconsider whether the overall

structure of the stack is a good one for parallel systems.

Researchers have made some progress in system support for pro-

The Future of Computing Performance: Game Over or Next Level?

THE END OF PROGRAMMING AS WE KNOW IT 129

viding and supporting parallel-programming models.

Over the years,

researchers and industry have developed parallel-programming system

tools, which include languages, compilers, runtime environments, librar-

ies, components, and frameworks to assist programmers and software

developers in managing parallelism. We list some examples below.

· Runtime abstractions: multiprogramming and virtualization. The

operating system can exploit chip multiprocessors at a coarse

granularity immediately because operating systems can run mul-

tiple user and kernel processes in parallel. Virtualization runs

multiple operating systems in parallel. However, it remains chal-

lenging to manage competition for shared resources, such as

caches, when a particular application load varies dramatically.

· Components: database transactions and Web applications. Data-

base-transaction systems provide an extremely effective abstrac-

tion in which programs use a sequential model, without the need

to worry about synchronization and communication, and the

database coordinates all the parallelism between user programs.

Success in this regard emerged from over 20 years of research in

parallelizing database systems.

· Frameworks: three-tiered Web applications and MapReduce.

Such frameworks as J2EE and Websphere make it easy to create

large-scale parallel Web applications. For example, MapReduce

(described above) simpliﬁes the development of a large class of

distributed applications that combine the results of the computa-

tion of distributed nodes. Web applications follow the database-

transaction model in which users write sequential tasks and the

framework manages the parallelism.

· Libraries: graphics libraries. Graphics libraries for DirectX 10 and

OpenGl hide the details of parallelism in graphics hardware from

the user.

· Languages: Cuda Fortress, Cilk, x10, and Chapel. These languages

seek to provide an array of high-level and low-level abstractions

that help programmers to develop classes of efﬁcient parallel

software faster.

What those tools suggest is that managing parallelism is another, more

One recent example was a parallel debugger, STAT, from the Lawrence Livermore Na-

tional Laboratory, available at http://www.paradyn.org/STAT/STAT.html, presented at

Supercomputing 2008. See Gregory L. Lee, Dong H. Ahn, Dorian C. Arnold, Bronis R. de

Supinski, Matthew Legendre, Barton P. Miller, Martin Schulz, and Ben Liblit, 2008, Lessons

learned at 208K: Towards debugging millions of cores, available online at ftp://ftp.cs.wisc.

edu/paradyn/papers/Lee08ScalingSTAT.pdf, last accessed on November 8, 2010.

The Future of Computing Performance: Game Over or Next Level?

130 THE FUTURE OF COMPUTING PERFORMANCE

challenging facet of software engineering—it can be thought of as akin

to a complex version of the problem of resource management. Parallel-

program productivity can be improved if we can develop languages that

provide useful software-engineering abstractions for parallelism, parallel

components and libraries that programmers can reuse, and a software-

hardware stack that can facilitate reasoning about all of them.

MEETING THE CHALLENGES OF PARALLELISM

The need for robust, general, and scalable parallel-software approaches

presents challenges that affect the entire computing ecosystem. There are

numerous possible paths toward a future that exploits abundant paral-

lelism while managing locality. Parallel programming and parallel com-

puters have been around since the 1960s, and much progress has been

made. Much of the challenge of parallel programming deals with making

parallel programs efﬁcient and portable without requiring heroic efforts

on the part of the programmer. No subﬁeld or niche will be able to solve

the problem of sustaining growth in computing performance on its own.

The uncertainty about the best way forward is inhibiting investment. In

other words, there is currently no parallel-programming approach that

can help drive hardware development. Historically, a vendor might have

taken on risk and invested heavily in developing an ecosystem, but given

all the uncertainty, there is not enough of this investment, which entails

risk as well as innovation. Research investment along multiple fronts, as

described in this report, is essential.

Software lasts a long time. The huge entrenched base of legacy soft-

ware is part of the reason that people resist change and resist investment

in new models, which may or may not take advantage of the capital

investment represented by legacy software. Rewriting software is expen-

sive. The economics of software results in pressure against any kind of

innovative models and approaches. It also explains why the approaches

we have seen have had relatively narrow applications or been incremen-

tal. Industry, for example, has turned to chip multiprocessors (CMPs)

that replicate existing cores a few times (many times in the future). Care

is taken to maintain backward compatibility to bring forward the exist-

ing multi-billion-dollar installed software base. With prospects dim for

A recent overview in Communications of the ACM articulates the view that develop-

ing software for parallel cores needs to become as straightforward as writing software for

traditional processors: Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny,

Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John

Wawrzynek, David Wessel, and Katherine Yelick, 2009, A view of the parallel computing

landscape, Communications of the ACM 52(10): 56-67, available online at http://cacm.

acm.org/magazines/2009/10/42368-a-view-of-the-parallel-computing-landscape/fulltext.

The Future of Computing Performance: Game Over or Next Level?

THE END OF PROGRAMMING AS WE KNOW IT 131

repeated doublings of single-core performance, CMPs inherit the mantle

as the most obvious alternative, and industry is motivated to devote sub-

stantial resources to moving compatible CMPs forward. The downside is

that the core designs being replicated are optimized for serial code with

support for dynamic parallelism discovery, such as speculation and out-

of-order execution, which may waste area and energy for programs that

are already parallel. At the same time, they may be missing some of the

features needed for highly efﬁcient parallel programming, such as light-

weight synchronization, global communication, and locality control in

software. A great deal of research remains to be done on on-chip network-

ing, cache coherence, and distributed cache and memory management.

One important role for academe is to explore CMP designs that are

more aggressive than industry’s designs. Academics should project both

hardware and software trends much further into the future to seek pos-

sible inﬂection points even if they are not sure when or even whether tran-

sitioning a technology from academe to industry will occur. Moreover,

researchers have the opportunity to break the shackles of strict backward

compatibility. Promising ideas should be nurtured to see whether they

can either create enough beneﬁt to be adopted without portability or to

enable portability strategies to be developed later. There needs to be an

intellectual ecosystem that enables ideas to be proposed, cross-fertilized,

and reﬁned and, ultimately, the best approaches to be adopted. Such an

ecosystem requires sufﬁcient resources to enable contributions from many

competing and cooperating research teams.

Meeting the challenges will involve essentially all aspects of comput-

ing. Focusing on a single component—assuming a CMP architecture or a

particular number of transistors, focusing on data parallelism or on het-

erogeneity, and so on—will be insufﬁcient to the task. Chapter 5 discusses

recommendations for research aimed at meeting the challenges.

The Future of Computing Performance: Game Over or Next Level?

Research, Practice, and Education to

Meet Tomorrow’s Performance Needs

arly in the 21st century, single-processor performance stopped

growing exponentially, and it now improves at a modest pace, if at

all. The abrupt shift is due to fundamental limits on the power efﬁ-

ciency of complementary metal oxide semiconductor (CMOS) integrated

circuits (used in virtually all computer chips today) and apparent limits

on what sorts of efﬁciencies can be exploited in single-core architectures.

A sequential-programming model will no longer be sufﬁcient to facilitate

future information technology (IT) advances.

Although efforts to advance low-power technology are important, the

only foreseeable way to continue advancing performance is with parallel-

ism. To that end, the hardware industry recently began doubling the num-

ber of cores per chip rather than focusing solely on more performance per

core and began deploying more aggressive parallel options, for example,

in graphics processing units (GPUs). Attaining dramatic IT advances in

the future will require programs and supporting software systems that

can access vast parallelism. The shift to explicitly parallel hardware will

fail unless there is a concomitant shift to useful programming models for

parallel hardware. There has been progress in that direction: extremely

skilled and savvy programmers can exploit vast parallelism (for example,

in what has traditionally been referred to as high-performance comput-

ing), domain-speciﬁc languages ﬂourish (for example, SQL and DirectX),

and powerful abstractions hide complexity (for example, MapReduce).

However, none of those developments comes close to the ubiquitous

support for programming parallel hardware that is required to sustain

132

The Future of Computing Performance: Game Over or Next Level?

RESEARCH, PRACTICE, AND EDUCATION 133

growth in computing performance and meet society’s expectations for IT.

(See Box 5.1 for additional context regarding other aspects of computing

research that should not be neglected while the push for parallelism in

software takes place.)

The ﬁndings and results described in this report represent a serious

set of challenges not only for the computing industry but also for the

many sectors of society that depend on advances in IT and computation.

The ﬁndings also pose challenges to U.S. competitiveness: a slowdown

in the growth of computing performance will have global economic and

political repercussions. The committee has developed a set of recom-

mended actions aimed at addressing the challenges, but the fundamental

power and energy constraints mean that even our best efforts may not

offer a complete solution. This chapter presents the committee’s recom-

mendations in two categories: research—the best science and engineering

minds must be brought to bear; and practice—how we go about devel-

oping computer hardware and software today will form a foundation

for future performance gains. Changes in education are also needed; the

emerging generation of technical experts will need to understand quite

different (and in some cases not yet developed) models of thinking about

IT, computation, and software.

SYSTEMS RESEARCH AND PRACTICE

Algorithms

In light of the inevitable trend toward parallel architectures and

emerging applications, one must ask whether existing applications are ame-

nable algorithmically for decomposition on any parallel architecture. Algorithms

based on context-dependent state machines are not easily amenable to

parallel decomposition. Applications based on those algorithms have

always been around and are likely to gain more importance as security

needs grow. Even so, there is a large amount of throughput parallelism in

these applications, in that many such tasks usually need to be processed

simultaneously by a data center.

At the other extreme, there are applications that have obvious paral-

lelism to exploit. The abundance of parallelism in a vast majority of those

underlying algorithms is data-level parallelism. One simple example of

data-level parallelism for mass applications is found in two-dimensional

(2D) and three-dimensional (3D) media-processing (image, signal, graph-

ics, and so on), which has an abundance of primitives (such as blocks,

triangles, and grids) that need to be processed simultaneously. Continu-

ous growth in the size of input datasets (from the text-heavy Internet of

the past to 2D-media-rich current Internet applications to emerging 3D

The Future of Computing Performance: Game Over or Next Level?

134 THE FUTURE OF COMPUTING PERFORMANCE

Internet applications) has been important in the steady increase in avail-

able parallelism for these sorts of applications.

A large and growing collection of applications lies between those

extremes. In these applications, there is parallelism to be exploited, but it

is not easy to extract: it is less regular and less structured in its spatial and

temporal control and its data-access and communication patterns. One

might argue that these have been the focus of the high-performance com-

puting (HPC) research community for many decades and thus are well

understood with respect to those aspects that are amenable to parallel

decomposition. The research community also knows that algorithms best

suited for a serial machine (for example, quicksort, simplex, and gaston)

differ from their counterparts that are best suited for parallel machines

BOX 5.1

React, But Don’t Overreact, to Parallelism

As this report makes clear, software and hardware researchers and practitio-

ners should address important concerns regarding parallelism. At such critical

junctures, enthusiasm seems to dictate that all talents and resources be applied

to the crisis at hand. Taking the longer view, however, a prudent research port-

folio must include concomitant efforts to advance all systems aspects, lest they

become tomorrow’s bottlenecks or crises.

For example, in the rush to innovate on chip multiprocessors (CMPs), it is

tempting to ignore sequential core performance and to deploy many simple

cores. That approach may prevail, but history and Amdahl’s law suggest caution.

Three decades ago, a hot technology was vectors. Pioneering vector machines,

such as the Control Data STAR-100 and Texas Instruments ASC, advanced vec-

tor technology without great concern for improving other aspects of compu-

tation. Seymour Cray, in contrast, designed the Cray-1

to have great vector

performance as well as to be the world’s fastest scalar computer. Ultimately, his

approach prevailed, and the early machines faded away.

Moreover, Amdahl’s law raises concern.

Amdahl’s limit argument assumed

that a fraction, P, of software execution is inﬁnitely parallelizable without over-

head, whereas the remaining fraction, 1 - P, is totally sequential. From that as-

sumption, it follows that the speedup with N cores—execution time on N cores

divided by execution time on one core—is governed by 1/[(1 - P) + P/N]. Many

learn that equation, but it is still instructive to consider its harsh numerical

consequences. For N = 256 cores and a fraction parallel P = 99%, for example,

speedup is bounded by 72. Moreover, Gustafson

made good “weak scaling”

arguments for why some software will fare much better. Nevertheless, the com-

mittee is skeptical that most future software will avoid sequential bottlenecks.

Even such a very parallel approach as MapReduce

has near-sequential activity

as the reduce phase draws to a close.

For those reasons, it is prudent to continue work on faster sequential cores,

The Future of Computing Performance: Game Over or Next Level?

RESEARCH, PRACTICE, AND EDUCATION 135

(for example, mergesort, interior-point, and gspan). Given the abundance

of single-thread machines in mass computing, commonly found imple-

mentations of these algorithms on mass machines are almost always the

nonparallel or serial-friendly versions. Attempts to extract parallelism

from the serial implementations are unproductive exercises and likely

to be misleading if they cause one to conclude that the original problem

has an inherently sequential nature. Therefore, there is an opportunity

to beneﬁt from the learning and experience of the HPC research and to

reformulate problems in terms amenable to parallel decomposition.

Three additional observations are warranted in the modern context

of data-intensive connected computing:

especially with an emphasis on energy efﬁciency (for example, on large-content

addressable-memory structures) and perhaps on-demand scaling (to be respon-

sive to software bottlenecks). Hill and Marty

illuminate some potential oppor-

tunities by extending Amdahl’s law with a corollary that models CMP hardware.

They ﬁnd, for example, that as Moore’s law provides more transistors, many

CMP designs beneﬁt from increasing the sequential core performance and

considering asymmetric (heterogeneous) designs where some cores provide

more performance (statically or dynamically).

Finally, although the focus in this box is on core performance, many other

aspects of computer design continue to require innovation to keep systems

balanced. Memories should be larger, faster, and less expensive. Nonvolatile

storage should be larger, faster, and less expensive and may merge with volatile

memory. Networks should be faster (higher bandwidth) and less expensive, and

interfaces to networks may need to get more closely coupled to host nodes. All

components must be designed for energy-efﬁcient operation and even more

energy efﬁciency when not in current use.

Richard M. Russell, 1978, The Cray-1 computer system, Communications of the ACM

21(1): 63-72

Gene M. Amdahl, 1967, Validity of the single-processor approach to achieving large scale

computing capabilities, AFIPS Conference Proceedings, Atlantic City, N.J,, April 18-20, 1967,

pp. 483-485.

John L. Gustafson, 1998, Reevaluating Amdahl’s law, Communications of the ACM 31(5):

532-533.

Jeffrey Dean and Sanjay Ghemawat, 2004, MapReduce: Simpliﬁed data processing on

large clusters, Symposium on Operating System Design and Implementation, San Francisco,

Cal., December 6-8, 2004.

Mark D. Hill and Michael R. Marty, 2008, Amdahl’s law in the multicore era, IEEE

Computer 41(7): 33-38, available online at http://www.cs.wisc.edu/multifacet/papers/

tr1593_amdahl_multicore.pdf.