5.6 Putting It All Together: AMD Opteron Memory Hierarchy ■ 329
physical page frame from the Instruction TLB (step 6). As the Opteron expects
16 bytes each instruction fetch, an additional 2 bits are used from the 6-bit block
offset to select the appropriate 16 bytes. Hence, 9 + 2 or 11 bits are used to send
16 bytes of instructions to the processor. The L1 cache is pipelined, and the
latency of a hit is 2 clock cycles. A miss goes to the second-level cache and to the
memory controller, to lower the miss penalty in case the L2 cache misses.
As mentioned earlier, the instruction cache is virtually addressed and physi-
cally tagged. On a miss, the cache controller must check for a synonym (two dif-
ferent virtual addresses that reference the same physical address). Hence, the
instruction cache tags are examined for synonyms in parallel with the L2 cache
tags during an L2 lookup. As the minimum page size is 4 KB or 12 bits and the
cache index plus block offset is 15 bits, the cache must check 2
3
or 8 blocks per
way for synonyms. Opteron uses the redundant snooping tags to check all syn-
onyms in 1 clock cycle. If it finds a synonym, the offending block is invalidated
and refetched from memory. This guarantees that a cache block can reside in only
one of the 16 possible data cache locations at any given time.
The second-level cache tries to fetch the block on a miss. The L2 cache is
1 MB, 16-way set associative with 64-byte blocks. It uses a pseudo-LRU scheme
by managing eight pairs of blocks LRU, and then randomly picking one of the
LRU pair on a replacement. The L2 index is
so the 34-bit block address (40-bit physical address – 6-bit block offset) is
divided into a 24-bit tag and a 10-bit index (step 8). Once again, the index and tag
are sent to all 16 groups of the 16-way set associative data cache (step 9), which
are compared in parallel. If one matches and is valid (step 10), it returns the block
in sequential order, 8 bytes per clock cycle. The L2 cache also cancels the mem-
ory request that the L1 cache sent to the controller. An L1 instruction cache miss
that hits in the L2 cache costs 7 processor clock cycles for the first word.
The Opteron has an exclusion policy between the L1 caches and the L2 cache
to try to better utilize the resources, which means a block is in L1 or L2 caches
but not in both. Hence, it does not simply place a copy of the block in the L2
cache. Instead, the only copy of the new block is placed in the L1 cache. The old
L1 block is sent to the L2 cache. If a block knocked out of the L2 cache is dirty, it
is sent to the write buffer, called the victim buffer in the Opteron.
In the last chapter, we showed how inclusion allows all coherency traffic to
affect only the L2 cache and not the L1 caches. Exclusion means coherency traf-
fic must check both. To reduce interference between coherency traffic and the
processor for the L1 caches, the Opteron has a duplicate set of address tags for
coherency snooping.
If the instruction is not found in the secondary cache, the on-chip memory
controller must get the block from main memory. The Opteron has dual 64-bit
memory channels that can act as one 128-bit channel, since there is only one
memory controller and the same address is sent on both channels (step 11). Wide
2
Index
Cache size
Block size Set associativity×
----------------------------------------------------------------------
1024K
64 16×
------------------ 1024 2
10
====