158
■
Chapter Three
Limits on Instruction-Level Parallelism
dependent instructions are handled by a renaming process that accommodates
dependent renaming in 1 clock. Once instructions are issued, the detection of
dependences is handled in a distributed fashion by the reservation stations or
scoreboard.
The set of instructions that is examined for simultaneous execution is called
the
window
. Each instruction in the window must be kept in the processor, and
the number of comparisons required every clock is equal to the maximum com-
pletion rate times the window size times the number of operands per instruction
(today up to 6
×
200
×
2 = 2400), since every pending instruction must look at
every completing instruction for either of its operands. Thus, the total window
size is limited by the required storage, the comparisons, and a limited issue rate,
which makes a larger window less helpful. Remember that even though existing
processors allow hundreds of instructions to be in flight, because they cannot
issue and rename more than a handful in any clock cycle, the maximum through-
out is likely to be limited by the issue rate. For example, if the instruction stream
contained totally independent instructions that all hit in the cache, a large window
would simply never fill. The value of having a window larger than the issue rate
occurs when there are dependences or cache misses in the instruction stream.
The window size directly limits the number of instructions that begin exe-
cution in a given cycle. In practice, real processors will have a more limited
number of functional units (e.g., no superscalar processor has handled more
than two memory references per clock), as well as limited numbers of buses
and register access ports, which serve as limits on the number of instructions
initiated per clock. Thus, the maximum number of instructions that may issue,
begin execution, or commit in the same clock cycle is usually much smaller
than the window size.
Obviously, the number of possible implementation constraints in a multiple-
issue processor is large, including issues per clock, functional units and unit
latency, register file ports, functional unit queues (which may be fewer than
units), issue limits for branches, and limitations on instruction commit. Each of
these acts as a constraint on the ILP. Rather than try to understand each of these
effects, however, we will focus on limiting the size of the window, with the
understanding that all other restrictions would further reduce the amount of paral-
lelism that can be exploited.
Figure 3.2 shows the effects of restricting the size of the window from which
an instruction can execute. As we can see in Figure 3.2, the amount of parallelism
uncovered falls sharply with decreasing window size. In 2005, the most advanced
processors have window sizes in the range of 64–200, but these window sizes are
not strictly comparable to those shown in Figure 3.2 for two reasons. First, many
functional units have multicycle latency, reducing the effective window size com-
pared to the case where all units have single-cycle latency. Second, in real proces-
sors the window must also hold any memory references waiting on a cache miss,
which are not considered in this model, since it assumes a perfect, single-cycle
cache access.