The desired behavior, clearly, is that library routines continue to work correctly in the presence of parallel
execution. The most natural behavior for concurrent library calls is to behave as if they were invoked one
at a time, although in some nondeterministic order that may vary from one execution to the next. For
instance, the desired behavior for malloc and free is to continue to allocate and deallocate storage,
maintaining the integrity of the heap in the presence of concurrent calls. Similarly, the desired behavior for
I/O operations is to perform the I/O in such a fashion that an individual I/O request is perceived to execute
atomically, although multiple requests may be interleaved in some order. For other routines, such as a
random number generator, it may even be sufficient to simply generate a random number on a per-thread
basis rather than coordinating the random number generation across multiple threads.
Such routines that continue to function correctly even when invoked concurrently by multiple threads are
called thread-safe routines. Although most routines don't start out being thread-safe, they can usually be
made thread-safe through a variety of schemes. Routines that are "pure"—that is, those that do not
contain any state but simply compute and return a value based on the value of the input parameters—are
usually automatically thread-safe since there is no contention for any shared resources across multiple
invocations. Most other routines that do share some state (such as I/O and malloc/free) can usually be
made thread-safe by adding a trivial lock/unlock pair around the entire subroutine (to be safe, we would
preferably add a nest_lock/nest_unlock). The lock/unlock essentially makes the routine single-threaded—
that is, only one invocation can execute at any one time, thereby maintaining the integrity of each
individual call. Finally, for some routines we can often rework the underlying algorithm to enable multiple
invocations to execute concurrently and provide the desired behavior; we suggested one such candidate
in the random number generator routine.
Although thread safety is highly desirable, the designers of OpenMP felt that although they could
prescribe the behavior of OpenMP constructs, they could not dictate the behavior of all library routines on
a system. Thread safety, therefore, has been left as a vendor-specific issue. It is the hope and indeed the
expectation that over time libraries on most vendors will become thread-safe and function correctly.
Meanwhile, you the programmer will have to determine the behavior provided on your favorite platform.
The second issue is concerned with the underlying implementation, rather than the semantics of OpenMP
itself. This issue relates to the mechanism used by a thread to wait for a synchronization event. What
does a thread do when it needs to wait for an event such as a lock to become available or for other
threads to arrive at a barrier? There are several implementation options in this situation. At one extreme
the thread could simply busy-wait for the synchronization event to occur, perhaps by spinning on a flag in
shared memory. Although this option will inform the thread nearly immediately once the synchronization is
signaled, the waiting thread can adversely affect the performance of other processes on the system. For
instance, it will consume processor cycles that could have been usefully employed by another thread, and
it may cause contention for system resources such as the bus, network, and/or the system memory. At
the other extreme the waiting thread could immediately surrender the underlying processor to some other
runnable process and block, waiting to resume execution only when the synchronization event has been
signaled. This option avoids the waste in system resources, but may incur some delay before the waiting
thread is informed of the synchronization event. Furthermore, it will incur the context switch overhead in
blocking and then unblocking the waiting thread. Finally, there is a range of intermediate options as well,
such as busy-waiting for a while followed by blocking, or busy-waiting combined with an exponential
back-off scheme.
This is clearly an implementation issue, so that the OpenMP programmer need never be concerned about
it, particularly from a correctness or semantic perspective. It does have some impact on performance, so
the advanced programmer would do well to be aware of these implementation issues on their favorite
platform.
Finally, we briefly mention the impact of the underlying cache on the performance of synchronization
constructs. Since all synchronization mechanisms fundamentally coordinate the execution of multiple
threads, they require the communication of variables between multiple threads. At the same time, cache-
based machines communicate data between processors at the granularity of a cache line, typically 64 to
128 bytes in size. If multiple variables happen to be located within the same cache line, then accesses to
one variable will cause all the other variables to also be moved around from one processor to another.
This phenomenon is called false sharing and is discussed in greater detail in Chapter 6. For now it is
sufficient to mention that since synchronization variables are likely to be heavily communicated, we would
do well to be aware of this accidental sharing and avoid it by padding heavily communicated variables out
to a cache line.