Modern processors 17
advantage of cache lines is that the latency penalty of a cache miss occurs only on
the first miss on an item belonging to a line. The line is fetched from memory as a
whole; neighboring items can then be loaded from cache with much lower latency,
increasing the cache hit ratio
γ
, not to be confused with the reuse ratio
β
. So if the
application shows some spatial locality, i.e., if the probability of successive accesses
to neighboring items is high, the latency problem can be significantly reduced. The
downside of cache lines is that erratic data access patterns are not supported. On the
contrary, not only does each load incur a miss and subsequent latency penalty, it also
leads to the transfer of a whole cache line, polluting the memory bus with data that
will probably never be used. The effective bandwidth available to the application will
thus be very low. On the whole, however, the advantages of using cache lines pre-
vail, and very few processor manufacturers have provided means of bypassing the
mechanism.
Assuming a streaming application working on DP floating point data on a CPU
with a cache line length of L
c
= 16 words, spatial locality fixes the hit ratio at
γ
=
(16 − 1)/16 = 0.94, a seemingly large value. Still it is clear that performance is
governed by main memory bandwidth and latency — the code is memory-bound. In
order for an application to be truly cache-bound, i.e., decouple from main memory so
that performance is not governed by memory bandwidth or latency any more,
γ
must
be large enough so the time it takes to process in-cache data becomes larger than the
time for reloading it. If and when this happens depends of course on the details of
the operations performed.
By now we can qualitatively interpret the performance data for cache-based ar-
chitectures on the vector triad in Figure 1.4. At very small loop lengths, the processor
pipeline is too long to be efficient. With growing N this effect becomes negligible,
and as long as all four arrays fit into the innermost cache, performance saturates at
a high value that is set by the L1 cache bandwidth and the ability of the CPU to is-
sue load and store instructions. Increasing N a little more gives rise to a sharp drop
in performance because the innermost cache is not large enough to hold all data.
Second-level cache has usually larger latency but similar bandwidth to L1 so that
the penalty is larger than expected. However, streaming data from L2 has the disad-
vantage that L1 now has to provide data for registers as well as continuously reload
and evict cache lines from/to L2, which puts a strain on the L1 cache’s bandwidth
limits. Since the ability of caches to deliver data to higher and lower hierarchy levels
concurrently is highly architecture-dependent, performance is usually hard to predict
on all but the innermost cache level and main memory. For each cache level another
performance drop is observed with rising N, until finally even the large outer cache is
too small and all data has to be streamed from main memory. The size of the different
caches is directly related to the locations of the bandwidth breakdowns. Section 3.1
will describe how to predict performance for simple loops from basic parameters like
cache or memory bandwidths and the data demands of the application.
Storing data is a little more involved than reading. In presence of caches, if data
to be written out already resides in cache, a write hit occurs. There are several pos-
sibilities for handling this case, but usually outermost caches work with a write-back
strategy: The cache line is modified in cache and written to memory as a whole when