288 Introduction to High Performance Computing for Scientists and Engineers
streams only if we may assume that achievable bandwidth is independent of
this number.
(c) If the length of the cache line is increased, latency stays unchanged but it takes
longer to transfer the data, i.e., the bandwidth contribution to total transfer time
gets larger. With a 64-byte cache line, we need 1 + 100/20 = 6 outstanding
prefetches, and merely 1+ 100/40 ≈4 at 128 bytes.
(d) Transferring two cache lines without latency takes 20ns, and eight Flops can
be performed during that time. This results in a theoretical performance of
4×10
8
Flops/sec, or 400MFlops/sec.
Solution 2.1 (page 62): The perils of branching.
Depending on whether data has to be fetched from memory or not, the perfor-
mance impact of the conditional can be huge. For out-of-cache data, i.e., large N, the
code performs identically to the standard vector triad, independent of the contents of
C. If N is small, however, performance breaks down dramatically if the branch cannot
be predicted, i.e., for a random distribution of C values. If C(i) is always smaller
or greater than zero, performance is restored because the branch can be predicted
perfectly in most cases.
Note that compilers can do interesting things to such a loop, especially if SIMD
operations are involved. If you perform actual benchmarking, try to disable SIMD
functionality on compilation to get a clear picture.
Solution 2.2 (page 62): SIMD despite recursion?
The operations inside a “SIMD unit” must be independent, but they may depend
on data which is a larger distance away, either negative or positive. Although pipelin-
ing may be suboptimal for offset< 0, offsets that are multiples of 4 (positive or
negative) do not inhibit SIMD vectorization. Note that the compiler will always re-
frain from SIMD vectorization in this loop if the offset is not known at compile time.
Can you think of a way to SIMD-vectorize this code evenif offset is not a multiple
of 4?
Solution 2.3 (page 62): Lazy construction on the stack.
A C-style array in a function or block is allocated on the stack. This is an op-
eration that costs close to no overhead, so it would not make a difference in terms
of performance. However, this option may not always be possible due to stack size
constraints.
Solution 2.4 (page 62): Fast assignment.
The STL std::vector<> class has the concept of capacity vs. size. If there is
a known upper limit to the vectorlength, assignment is possible without re-allocation:
1 const int max_length=1000;