Copyright © National Academy of Sciences. All rights reserved.
The Future of Computing Performance: Game Over or Next Level?
THE END OF PROGRAMMING AS WE KNOW IT 113
dimension, because one of the matrices will be accessed in an order that
allows little reuse of data in the cache. However, if we decompose the
problem into smaller matrix multiplication problems, we can capture
locality, reusing each word fetched from memory many times.
Suppose we have a memory capable of holding 256 kB (32 kW) 1
mm from our floating-point unit. The local memory is large enough to
hold three 100 × 100 submatrices, one for each input operand and one
for the partial result. We can perform a 100 × 100 matrix multiplication
entirely out of the local memory, performing 2 × 10
6
operations with only
4 ×10
4
memory references—a ratio of 50 operations per reference. We can
apply this blocking recursively. If there is aggregate on-chip memory of
32 MB (4 MW), we can hold three 1,000 × 1,000 submatrices at this level
of the storage hierarchy. In a seminal paper by Hong and Kung, that idea
was proved to be optimal for matrix multiplication in the sense that this
kind of blocked algorithm moves the minimum amount of data pos-
sible between processor and memory system.
9
Other array computations,
including convolutions and fast Fourier transformations, can be blocked
in this manner—although with different computation-to-communication
ratios—and there are theoretical results on the optimality of communi-
cation for several linear algebra problems for both parallel and serial
machines.
The recursive nature of the blocked algorithm also led to the notion
of “cache-oblivious” algorithms, in which the recursive subdivision pro-
duces successively smaller subproblems that eventually fit into a cache or
other fast memory layer.
10
Whereas other blocked algorithms are imple-
mented to match the size of a cache, the oblivious algorithms are opti-
mized for locality without having specific constants, such as cache size, in
their implementation. Locality optimizations for irregular codes, such as
graph algorithms, can be much more difficult because the data structures
are built with pointers or indexed structures that lead to random memory
accesses. Even some of the graph algorithms have considerable locality
that can be realized by partitioning the graph subgraphs that fit into a
local memory and reorganizing the computations to operate on each
subgraph with reuse before moving on to the next subgraph. There are
many algorithms and software libraries for performing graph partition-
9
See Hong Jia-Wei and H.T. Kung, 1981, I/O complexity: The red-blue pebble game, Pro-
ceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, Milwaukee,
Wis., May 11-13, 1981, pp. 326-333
10
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran, 1999,
Cache-oblivious algorithms, Proceedings of the 40th IEEE Symposium on Foundations of
Computer Science, New York, N.Y., October 17-19, 1999, pp. 285-297.