Hennessy John L., Patterson David A. Computer Architecture

Подождите немного. Документ загружается.

282 ■ Chapter Four Multiprocessors and Thread-Level Parallelism

d. [15] <4.4> P0: write 120 <-- 80

P1: write 120 <-- 90

e. [15] <4.4> P0: replace 110

P1: read 110

f. [15] <4.4> P1: write 110 <-- 80

P0: replace 110

g. [15] <4.4> P1: read 110

P0: replace 110

4.22 [20/20/20/20/20] <4.4> For the multiprocessor illustrated in Figure 4.42 imple-

menting the protocol described in Figure 4.43 and Figure 4.44, assume the follow-

ing latencies:

■ CPU read and write hits generate no stall cycles.

■ Completing a miss (i.e., do Read and do Write) takes L

ack

cycles only if it

is performed in response to the Last Ack event (otherwise it gets done

while the data is copied to cache).

■ A CPU read or write that generates a replacement event issues the corre-

sponding GetShared or GetModiﬁed message before the PutModiﬁed

message (e.g., using a writeback buffer).

■ A cache controller event that sends a request or acknowledgment message

(e.g., GetShared) has latency L

send_msg

cycles.

■ A cache controller event that reads the cache and sends a data message has

latency L

send_data

cycles.

■ A cache controller event that receives a data message and updates the

cache has latency L

rcv_data

■ A memory controller incurs L

send_msg

latency when it forwards a request

message.

■ A memory controller incurs an additional L

inv

cycles for each invalidate

that it must send.

■ A cache controller incurs latency L

send_msg

for each invalidate that it re-

ceives (latency is until it sends the Ack message).

■ A memory controller has latency L

read_memory

cycles to read memory and

send a data message.

■ A memory controller has latency L

write_memory

to write a data message to

memory (latency is until it sends the Ack message).

■ A nondata message (e.g., request, invalidate, Ack) has network latency

req_msg

cycles

■ A data message has network latency L

data_msg

cycles.

Consider an implementation with the performance characteristics summarized in

Figure 4.45.

Case Studies with Exercises by David A. Wood ■ 283

For the sequences of operations below, the cache contents of Figure 4.42, and the

directory protocol above, what is the latency observed by each processor node?

a. [20] <4.4> P0: read 100

b. [20] <4.4> P0: read 128

c. [20] <4.4> P0: write 128 <-- 68

d. [20] <4.4> P0: write 120 <-- 50

e. [20] <4.4> P0: write 108 <-- 80

4.23 [20] <4.4> In the case of a cache miss, both the switched snooping protocol

described earlier and the directory protocol in this case study perform the read or

write operation as soon as possible. In particular, they do the operation as part of

the transition to the stable state, rather than transitioning to the stable state and

simply retrying the operation. This is not an optimization. Rather, to ensure for-

ward progress, protocol implementations must ensure that they perform at least

one CPU operation before relinquishing a block.

Suppose the coherence protocol implementation didn’t do this. Explain how this

might lead to livelock. Give a simple code example that could stimulate this

behavior.

4.24 [20/30] <4.4> Some directory protocols add an Owned (O) state to the protocol,

similar to the optimization discussed for snooping protocols. The Owned state

behaves like the Shared state, in that nodes may only read Owned blocks. But it

behaves like the Modiﬁed state, in that nodes must supply data on other nodes’ Get

requests to Owned blocks. The Owned state eliminates the case where a GetShared

request to a block in state Modiﬁed requires the node to send the data both to the

requesting processor and to the memory. In a MOSI directory protocol, a Get-

Shared request to a block in either the Modiﬁed or Owned states supplies data to

the requesting node and transitions to the Owned state. A GetModiﬁed request in

Implementation 1

Action Latency

send_msg 6

send_data 20

rcv_data 15

read_memory 100

write_memory 20

inv 1

ack 4

req_msg 15

data_msg 30

Figure 4.45 Directory coherence latencies.

284 ■ Chapter Four Multiprocessors and Thread-Level Parallelism

state Owned is handled like a request in state Modiﬁed. This optimized MOSI pro-

tocol only updates memory when a node replaces a block in state Modiﬁed or

Owned.

a. [20] <4.4> Explain why the MS

state in the protocol is essentially a “tran-

sient” Owned state.

b. [30] <4.4> Modify the cache and directory protocol tables to support a stable

Owned state.

4.25 [25/25] <4.4> The advanced directory protocol described above relies on a point-

to-point ordered interconnect to ensure correct operation. Assuming the initial

cache contents of Figure 4.42 and the following sequences of operations, explain

what problem could arise if the interconnect failed to maintain point-to-point

ordering. Assume that the processors perform the requests at the same time, but

they are processed by the directory in the order shown.

a. [25] <4.4> P1: read 110

P15: write 110 <-- 90

b. [25] <4.4> P1: read 110

P0: replace 110

5.1

Introduction 288

5.2

Eleven Advanced Optimizations of Cache Performance 293

5.3

Memory Technology and Optimizations 310

5.4

Protection: Virtual Memory and Virtual Machines 315

5.5

Crosscutting Issues: The Design of Memory Hierarchies 324

5.6

Putting It All Together: AMD Opteron Memory Hierarchy 326

5.7

Fallacies and Pitfalls 335

5.8

Concluding Remarks 341

5.9

Historical Perspective and References 342

Case Studies with Exercises by Norman P. Jouppi 342

Memory Hierarchy

Design

Ideally one would desire an indeﬁnitely large memory capacity such

that any particular . . . word would be immediately available. . . . We

are . . . forced to recognize the possibility of constructing a hierarchy of

memories, each of which has greater capacity than the preceding but

which is less quickly accessible.

A. W. Burks, H. H. Goldstine,

and J. von Neumann

Preliminary Discussion of the

Logical Design of an Electronic

Computing Instrument

(1946)

288

■

Chapter Five

Memory Hierarchy Design

Computer pioneers correctly predicted that programmers would want unlimited

amounts of fast memory. An economical solution to that desire is a

memory hier-

archy,

which takes advantage of locality and cost-performance of memory

technologies. The

principle of locality,

presented in the ﬁrst chapter, says that

most programs do not access all code or data uniformly. Locality occurs in time

(

temporal locality

) and in space (

spatial locality

). This principle, plus the guide-

line that smaller hardware can be made faster, led to hierarchies based on memo-

ries of different speeds and sizes. Figure 5.1 shows a multilevel memory

hierarchy, including typical sizes and speeds of access.

Since fast memory is expensive, a memory hierarchy is organized into several

levels—each smaller, faster, and more expensive per byte than the next lower

level. The goal is to provide a memory system with cost per byte almost as low as

the cheapest level of memory and speed almost as fast as the fastest level.

Note that each level maps addresses from a slower, larger memory to a

smaller but faster memory higher in the hierarchy. As part of address mapping,

the memory hierarchy is given the responsibility of address checking; hence, pro-

tection schemes for scrutinizing addresses are also part of the memory hierarchy.

The importance of the memory hierarchy has increased with advances in per-

formance of processors. Figure 5.2 plots processor performance projections

against the historical performance improvement in time to access main memory.

Clearly, computer architects must try to close the processor-memory gap.

The increasing size and thus importance of this gap led to the migration of the

basics of memory hierarchy into undergraduate courses in computer architecture,

and even to courses in operating systems and compilers. Thus, we’ll start with a

quick review of caches. The bulk of the chapter, however, describes more

advanced innovations that address the processor-memory performance gap.

When a word is not found in the cache, the word must be fetched from the

memory and placed in the cache before continuing. Multiple words, called a

Figure 5.1

The levels in a typical memory hierarchy in embedded, desktop, and

server computers.

As we move farther away from the processor, the memory in the

level below becomes slower and larger. Note that the time units change by factors of

10—from picoseconds to milliseconds—and that the size units change by factors of

1000—from bytes to terabytes.

5.1 Introduction

Memory

bus

CPU

reference

Cache

reference

Registers

Memory

reference

I/O devices

Disk

memory

reference

I/O bus

Size:

Speed:

500 bytes

250 ps

64 KB

1 ns

1 GB

100 ns

1 TB

10 ms

5.1 Introduction

■

289

block

(or

line

), are moved for efﬁciency reasons. Each cache block includes a

tag

to see which memory address it corresponds to.

A key design decision is where blocks (or lines) can be placed in a cache. The