298 Introduction to High Performance Computing for Scientists and Engineers
of the application execute
α
times faster than on the host, whereas the rest stays
unchanged. Using Amdahl’s Law, s is the host part and p = 1−s is the accelerated
part. Therefore, the asymptotic performance is governed by thehost part; if, e.g.,
α
=
100 and s = 10
−2
, the speedup is only 50, and we are wasting half of the accelerator’s
capability. To get a speedup of r
α
with 0 < r < 1, we need to solve
1
s+
1−s
α
= r
α
(B.13)
for s, leading to
s =
r
−1
−1
α
−1
, (B.14)
which yields s ≈ 1.1 ×10
−3
at r = 0.9 and
α
= 100. The lesson is that efficient
use of accelerators requires a major part of original execution time (much more than
just 1 −1/
α
) to be moved to special hardware. Incidentally, Amdahl had formu-
lated his famous law in his original paper along the lines of “accelerated execu-
tion” versus “housekeeping and data management” effort on the ILLIAC IV super-
computer [R40], which implemented a massively data-parallel SIMD programming
model:
A fairly obvious conclusion which can be drawn at this point is that the effort
expended on achieving high parallel processing rates is wasted unless it is ac-
companied by achievements in sequential processing rates of very nearly the
same magnitude. [M45]
This statement fits perfectly to the situation described above. In essence, it was al-
ready derived mathematically in Section 5.3.5, albeit in a slightly different context.
One may argue that enlarging the (accelerated) problem size would mitigate the
problem, but this is debatable because of the memory size restrictions on accelerator
hardware. The larger the performance of a computational unit (core, socket, node,
accelerator), the larger its memory must be to keep the serial (or unaccelerated) part
and communication overhead under control.
Solution 6.1 (page 162): OpenMP correctness.
The variable noise in subroutine f() carries an implicit SAVE attribute, be-
cause it is initialized on declaration. Its initial value will thus only be set on the first
call, which is exactly what would be intended if the code were serial. However, call-
ing f() from a parallel region makes noise a shared variable, and there will be
a race condition. To correct this problem, either noise should be provided as an
argument to f() (similar to the seed in thread safe random number generators), or
its update should be protected via a synchronization construct.
Solution 6.2 (page 162):
π
by Monte Carlo.
The key ingredient is a thread safe random number generator. According to the
OpenMP standard [P11], the RANDOM_NUMBER() intrinsic subroutine in Fortran 90