3.3 Limitations on ILP for Realizable Processors ■ 167
yield. The data in these figures are likely to be very optimistic for another reason.
There are no issue restrictions among the 64 instructions: They may all be mem-
ory references. No one would even contemplate this capability in a processor in
the near future. Unfortunately, it is quite difficult to bound the performance of a
processor with reasonable issue restrictions; not only is the space of possibilities
quite large, but the existence of issue restrictions requires that the parallelism be
evaluated with an accurate instruction scheduler, making the cost of studying pro-
cessors with large numbers of issues very expensive.
In addition, remember that in interpreting these results, cache misses and
nonunit latencies have not been taken into account, and both these effects will
have significant impact!
The most startling observation from Figure 3.7 is that with the realistic pro-
cessor constraints listed above, the effect of the window size for the integer pro-
grams is not as severe as for FP programs. This result points to the key difference
between these two types of programs. The availability of loop-level parallelism in
two of the FP programs means that the amount of ILP that can be exploited is
higher, but that for integer programs other factors—such as branch prediction,
register renaming, and less parallelism to start with—are all important limita-
tions. This observation is critical because of the increased emphasis on integer
performance in the last few years. Indeed, most of the market growth in the last
decade—transaction processing, web servers, and the like—depended on integer
performance, rather than floating point. As we will see in the next section, for a
realistic processor in 2005, the actual performance levels are much lower than
those shown in Figure 3.7.
Given the difficulty of increasing the instruction rates with realistic hardware
designs, designers face a challenge in deciding how best to use the limited
resources available on an integrated circuit. One of the most interesting trade-offs
is between simpler processors with larger caches and higher clock rates versus
more emphasis on instruction-level parallelism with a slower clock and smaller
caches. The following example illustrates the challenges.
Example Consider the following three hypothetical, but not atypical, processors, which we
run with the SPEC gcc benchmark:
1. A simple MIPS two-issue static pipe running at a clock rate of 4 GHz and
achieving a pipeline CPI of 0.8. This processor has a cache system that yields
0.005 misses per instruction.
2. A deeply pipelined version of a two-issue MIPS processor with slightly
smaller caches and a 5 GHz clock rate. The pipeline CPI of the processor is
1.0, and the smaller caches yield 0.0055 misses per instruction on average.
3. A speculative superscalar with a 64-entry window. It achieves one-half of the
ideal issue rate measured for this window size. (Use the data in Figure 3.7.)