8 Introduction to High Performance Computing for Scientists and Engineers
2. Superscalar architecture. Superscalarity provides “direct” instruction-level
parallelism by enabling an instruction throughput of more than one per cycle.
This requires multiple, possibly identical functional units, which can operate
currently (see Section 1.2.4 for details). Modern microprocessors are up to
six-way superscalar.
3. Data parallelism through SIMD instructions. SIMD (Single Instruction Multi-
ple Data) instructions issue identical operations on a whole array of integer or
FP operands, usually in special registers. They improve arithmetic peak per-
formance without the requirement for increased superscalarity. Examples are
Intel’s “SSE” and its successors, AMD’s “3dNow!,” the “AltiVec” extensions
in Power and PowerPC processors, and the “VIS” instruction set in Sun’s Ul-
traSPARC designs. See Section 1.2.5 for details.
4. Out-of-order execution. If arguments to instructions are not available in regis-
ters “on time,” e.g., because the memory subsystem is too slow to keep up with
processor speed, out-of-order execution can avoid idle times (also calledstalls)
by executing instructions that appear later in the instruction stream but have
their parameters available. This improves instruction throughput and makes it
easier for compilers to arrange machine code for optimal performance. Cur-
rent out-of-order designs can keep hundreds of instructions in flight at any
time, using a reorder buffer that stores instructions until they become eligible
for execution.
5. Larger caches. Small, fast, on-chip memories serve as temporary data storage
for holding copies of data that is to be used again “soon,” or that is close to
data that has recently been used. This is essential due to the increasing gap
between processor and memory speeds (see Section 1.3). Enlarging the cache
size does usually not hurt application performance, but there is some tradeoff
because a big cache tends to be slower than a small one.
6. Simplified instruction set. In the 1980s, a general move from the CISC to the
RISC paradigm took place. In a CISC (Complex Instruction Set Computer),
a processor executes very complex, powerful instructions, requiring a large
hardware effort for decoding but keeping programs small and compact. This
lightened the burden on programmers, and saved memory, which was a scarce
resource for a long time. A RISC (Reduced Instruction Set Computer) features
a very simple instruction set that can be executed veryrapidly (fewclock cycles
per instruction; in the extreme case each instruction takes only a single cycle).
With RISC, the clock rate of microprocessors could be increased in a way that
would never have been possible with CISC. Additionally, it frees up transistors
for other uses. Nowadays, most computer architectures significant for scientific
computing use RISC at the low level. Although x86-based processors execute
CISC machine code, they perform an internal on-the-fly translation into RISC
“
µ
-ops.”
In spite of all innovations, processor vendors have recently been facing high obsta-
cles in pushing the performancelimits of monolithic, single-core CPUs to new levels.