endif
return
end
When this program is executed, it will create a team of threads in the taskqueue subroutine, with each
thread repeatedly fetching and processing tasks. During the course of processing a task, a thread may
encounter the parallel do construct (if it is processing an interior column). At this point this thread will
create an additional, brand-new team of threads, of which it will be the master, to execute the iterations of
the do loop. The execution of this do loop will proceed in parallel with this newly created team, just like
any other parallel region. After the parallel do loop is over, this new team will gather at the implicit barrier,
and the original thread will return to executing its portion of the code. The slave threads of the now
defunct team will become dormant. The nested parallel region therefore simply provides an additional
level of parallelism and semantically behaves just like a nonnested parallel region.
It is sometimes tempting to confuse work-sharing constructs with the parallel region construct, so the
distinctions between them bear repeating. A parallel construct (including each of the parallel, parallel do,
and parallel sections directives) is a complete, encapsulated construct that attempts to speed up a portion
of code through parallel execution. Because it is a self-contained construct, there are no restrictions on
where and how often a parallel construct may be encountered.
Work-sharing constructs, on the other hand, are not self-contained but instead depend on the surrounding
context. They work in tandem with an enclosing parallel region (invocation from serial code is like being
invoked from a parallel region but with a single thread). We refer to this as a binding of a work-sharing
construct to an enclosing parallel region. This binding may be either lexical or dynamic, as is the case
with orphaned work-sharing constructs. Furthermore, in the presence of nested parallel constructs, this
binding of a work-sharing construct is to the closest enclosing parallel region.
To summarize, the behavior of a work-sharing construct depends on the surrounding context; therefore
there are restrictions on the usage of work-sharing constructs—for example, all (or none) of the threads
must encounter each work-sharing construct. A parallel construct, on the other hand, is fully self-
contained and can be used without any such restrictions. For instance, as we show in Example 4.24, only
the threads that process an interior column encounter the nested parallel do construct.
Let us now consider a parallel region that happens to execute serially, say, due to an if clause on the
parallel region construct. There is absolutely no effect on the semantics of the parallel construct, and it
executes exactly as if in parallel, except on a team consisting of only a single thread rather than multiple
threads. We refer to such a region as a serialized parallel region. There is no change in the behavior of
enclosed work-sharing constructs—they continue to bind to the serialized parallel region as before. With
regard to synchronization constructs, the barrier construct also binds to the closest dynamically enclosing
parallel region and has no effect if invoked from within a serialized parallel region. Synchronization
constructs such as critical and atomic (presented in Chapter 5), on the other hand, synchronize relative to
all other threads, not just those in the current team. As a result, these directives continue to function even
when invoked from within a serialized parallel region. Overall, therefore, the only perceptible difference
due to a serialized parallel region is in the performance of the construct.
Unfortunately there is little reported practical experience with nested parallelism. There is only a limited
understanding of the performance and implementation issues with supporting multiple levels of
parallelism, and even less experience with the needs of applications programs and its implication for
programming models. For now, nested parallelism continues to be an area of active research. Because
many of these issues are not well understood, by default OpenMP implementations support nested
parallel constructs but serialize the implementation of nested levels of parallelism. As a result, the
program behaves correctly but does not benefit from additional degrees of parallelism.
You may change this default behavior by using either the runtime library routine
call omp_set_nested (.TRUE.)
or the environment variable
setenv OMP_NESTED TRUE
to enable nested parallelism; you may use the value false instead of true to disable nested parallelism. In
addition, OpenMP also provides a routine to query whether nested parallelism is enabled or disabled: