Shared-memory parallel programming with OpenMP 159
Listing 6.6: A straightforward implementation of the Gauss–Seidel algorithm in three dimen-
sions. The highlighted references cause loop-carried dependencies.
1 double precision, parameter :: osth=1/6.d0
2 do it=1,itmax ! number of iterations (sweeps)
3 ! not parallelizable right away
4 do k=1,kmax
5 do j=1,jmax
6 do i=1,imax
7 phi(i,j,k) = ( phi(i-1,j,k) + phi(i+1,j,k)
8 + phi(i,j-1,k) + phi(i,j+1,k)
9 + phi(i,j,k-1) + phi(i,j,k+1) )
*
osth
10 enddo
11 enddo
12 enddo
13 enddo
before, the Gauss–Seidel algorithm has fundamentally different convergence proper-
ties as compared to Jacobi (Stein-Rosenberg Theorem).
Parallelization of the Jacobi algorithm is straightforward (see the previous sec-
tion) because all updates of a sweep go to a different array, but this is not the case
here. Indeed, just writing a PARALLEL DO directive in front of the k loop would
lead to race conditions and yield (wrong) results that most probably vary from run to
run.
Still it is possible to parallelize the code with OpenMP. The key idea is to find a
way of traversing the lattice that fulfills the dependency constraints imposed by the
stencil update. Figures 6.4 and 6.5 show how this can be achieved: Instead of simply
cutting the k dimension into chunks to be processed by OpenMP threads, a wavefront
travels through the lattice in k direction. The dimension along which to parallelize
is j, and each of the t threads T
0
...T
t−1
gets assigned a consecutive chunk of size
j
max
/t along j. This divides the lattice into blocks of size i
max
× j
max
/t ×1. The very
first block with the lowest k coordinate can only be updated by a single thread (T
0
),
which forms a “wavefront” by itself (W
1
in Figure 6.4). All other threads have to
wait in a barrier until this block is finished. After that, the second wavefront (W
2
)
can commence, this time with two threads (T
0
and T
1
), working on two blocks in
parallel. After another barrier, W
3
starts with three threads, and so forth. W
t
is the
first wavefront to actually utilize all threads, ending the so-called wind-up phase.
Some time (t wavefronts) before the sweep is complete, the wind-down phase begins
and the number of working threads is decreased with each successive wavefront. The
block with the largest k and j coordinates is finally updated by a single-thread (T
t−1
)
wavefront W
n
again. In the end, n= k
max
+t −1 wavefronts have traversed the lattice
in a “pipeline parallel” pattern. Of those, 2(t −1) have utilized less than t threads.
The whole scheme can thus only be load balanced if k
max
≫t.
Listing 6.7 shows a possible implementation of this algorithm. We assume here
for simplicity that j
max
is a multiple of the number of threads. Variable l counts the