programs. Now we are going to shift gears a bit and talk in general about how to go about improving the
performance of parallel codes. Specific approaches to tuning depend a lot on the performance tools
available on a specific platform, and there are large variations in the tools supported on different
platforms. Therefore, we will not go into too much specific detail, but instead shall outline some general
principles.
As discussed in earlier chapters, there are two common styles of programming in the OpenMP model:
loop-level parallelization and domain decomposition. Loop-level parallelization allows for an incremental
approach to parallelization. It is easier to start with a serial program and gradually parallelize more of the
code as necessary. When using this technique, the first issue to worry about is coverage. Have you
parallelized enough of your code? And if not, where should you look for improvements? It is important not
to waste time parallelizing a piece of code that does not contribute significantly to the application's
execution time. Parallelize the important parts of the code first.
Most (if not all) systems provide profiling tools to determine what fraction of the execution time is spent in
different parts of the source code. These profilers tend to be based on one of two approaches, and many
systems provide both types. The first type of profiler is based on pc-sampling. A clock-based interrupt is
set to go off at fixed time intervals (typical intervals are perhaps 1–10 milliseconds). At each interrupt, the
profiler checks the current program counter, pc, and increments a counter. A postprocessing step uses
the counter values to give the user a statistical view of how much time was spent in every line or
subroutine.
The second type of profiler is based on instrumentation. The executable is modified to increment a
counter at every branch (goto, loop, etc.) or label. The instrumentation allows us to obtain an exact count
of how many times every instruction was executed. The tools melds this count with a static estimate of
how much time an instruction or set of instructions takes to execute. Together, this provides an estimate
for how much time is spent in different parts of the code. The advantage of the instrumentation approach
is that it is not statistical. We can get an exact count of how many times an instruction is executed.
Nonetheless, particularly with parallel codes, be wary of instrumentation-based profilers; pc-sampling
usually gives more accurate information.
There are two problems with instrumentation-based approaches that particularly affect parallel codes.
The first problem is that the tool must make an estimate of how much time it takes to execute a set of
instructions. As we have seen, the amount of time it takes to execute an instruction can be highly context
sensitive. In particular, a load or store might take widely varying amounts of time depending on whether
the data is in cache, in some other processor's cache, or in memory. The tool typically does not have the
context to know, so typically instrumentation-based profilers assume that all memory references hit in the
cache. Relying on such information might lead us to make the wrong trade-offs. For example, using such
a tool you might decide to use a dynamic schedule to minimize load imbalance while completely ignoring
the more important locality considerations.
The second problem with instrumentation-based profilers is that they are intrusive. In order to count
events, the tool has changed the code. It is quite possible that most of the execution time goes towards
counting events rather than towards the original computation. On uniprocessor codes this might not be an
important effect. While instrumenting the code might increase the time it takes to run the instrumented
executable, it does not change the counts of what is being instrumented. With multiprocessor codes,
though, this is not necessarily the case. While one processor is busy counting events, another processor
might be waiting for the first processor to reach a barrier. The more time the first processor spends in
instrumentation code, the more time the second processor spends at the barrier. It is possible that an
instrumented code will appear to have a load imbalance problem while the original, real code does not.
Using a pc-sampling-based profiler, we can discover which lines of code contribute to execution time in
the parallel program. Often, profilers in OpenMP systems give an easy way to distinguish the time spent
in a parallel portion of the code from the time spent in serial portions. Usually, each parallel loop or region
is packaged in a separate subroutine. Looking at a profile, it is therefore very easy to discover what time
is spent in serial portions of the code and in which serial portions. For loop-based parallelization
approaches, this is the key to figuring out how much of the program is actually parallelized and which
additional parts of the program we should concentrate on parallelizing.
With domain decomposition, coverage is less of an issue. Often, the entire program is parallelized.
Sometimes, though, coverage problems can exist, and they tend to be more subtle. With domain
decomposition, portions of code might be replicated; that is, every processor redundantly computes the