of view, though, memory is not monolithic. It might take more time to bring a(i) from memory into the
processor than it does to bring in b(i), and bringing in a(i) at one point in time of the program's execution
might take longer than bringing it in at a later point in time.
Two factors lead to the design of machines with varying memory latencies. The first factor is that it is both
cheaper and more feasible to build faster memory if that memory is small. Multiple technologies exist for
building memory. Faster technologies tend to be more expensive (presumably one can also devise slow
and expensive technology, but no one will use it). That means that for a given budget, we can pay for a
small amount of faster memory and a larger amount of slower memory. We cannot replace the slower
memory with the faster one without raising costs. Even ignoring costs, it is technologically more feasible
to make accesses to a smaller memory faster than accesses to a larger one. Part of the time required to
bring data from memory into a processor is the time required to find the data, and part of the time is the
time required to move the data. Data moves using electrical signals that travel at a finite velocity. That
means that the closer a piece of data is to the processor, the faster it can be brought into the processor.
There is a limit to how much space exists within any given distance from the processor. With
microprocessors, memory that can be put on the same chip as the processor is significantly faster to
access than memory that is farther away on some other chip, and there is a limit to how much memory
can be put on the processor chip.
The fact that we can build smaller amounts of faster memory would not necessarily imply that we should.
Let's say, for example, that we can build a system with 10 times as much slow memory that is 10 times
less expensive. Let's say that memory references are distributed randomly between the fast and slow
memory. Since there is more slow memory, 90% of the references will be to the slow memory. Even if the
fast memory is infinitely fast, we will only speed up the program by about 10% (this percentage can be
derived using another variation of Amdahl's law). The costs, on the other hand, have increased by a
factor of two. It is unlikely that this trade-off is worthwhile.
The second factor that leads to the design of machines with varying memory latencies, and that makes it
worthwhile to build small but fast memories, is the concept of locality. Locality is the property that if a
program accesses a piece of memory, there is a much higher than random probability that it will again
access the same location "soon." A second aspect of locality is that if a program accesses a piece of
memory, there is a much higher than random probability that it will access a nearby location soon. The
first type of locality is called temporal locality; the second is called spatial. Locality is not a law of nature. It
is perfectly possible to write a program that has completely random or systematically nonlocal memory
behavior. Locality is just an empirical rule. Many programs naturally have locality, and many others will be
written to have locality if the prevalent machines are built to take advantage of locality.
Let us consider some examples that illustrate locality. Let us say that we wish to zero a large array. We
can zero the elements in any order, but the natural way to zero them is in lexical order, that is, first a(0),
then a(1), then a(2), and so on. Zeroing an array this way exhibits spatial locality. Now let's assume that
we want to apply some type of convolution filter to an image (e.g., a simple convolution may update each
element to be a weighted average of its neighboring values in a grid). Each element in the image will be
touched a number of times equal to the size of the filter. If the filter is moved lexically through the image,
all accesses to a given point will occur close in time. This type of algorithm has both spatial and temporal
locality.
The prevalence of locality allows us to design a small and fast memory system that can greatly impact
performance. Consider again the case where 10% of the memory is fast and 90% is slow. If accesses are
random and only 10% of the memory accesses are to the fast memory, the cost of the fast memory is
probably too large to be worthwhile. Caches, on the other hand, provide a way to exploit locality to insure
that a much larger percentage of your memory references are to faster memory.
Caches essentially work as follows. All information is kept in the larger slower memory. Every time the
system brings a memory reference into the processor, it "caches" a copy of the data in the smaller, faster
memory, the cache. For simplicity, we will use the generic term "memory" to refer to the slower, bigger
memory and "cache" to refer to the faster, smaller memory. Before performing a memory reference, the
system checks to see if the data is inside the cache. If the data is found, the system does not go to
memory; it just retrieves the copy of the data found in the cache. Only when the data is not in the cache
does the system go to memory. Since the cache is smaller than the memory, it cannot contain a copy of
all the data in the memory. In other words, the cache can fill up. In such cases, the cache must remove
some other data in order to make room for the new data.