Why can using dynamic threads improve performance? To understand, let's first consider the simple
example of trying to run a three-thread job on a system with only two processors. The system will need to
multiplex the jobs among the different processors. Every so often, the system will have to stop one thread
from executing and give the processor to another thread. Imagine, for example, that the stopped, or
preempted, thread was in the middle of executing a critical section. Neither of the other two threads will
be able to proceed. They both need to wait for the first thread to finish the critical section. If we are lucky,
the operating system might realize that the threads are waiting on a critical section and might immediately
preempt the waiting threads in favor of the previously preempted thread. More than likely, however, the
operating system will not know, and it will let the other two threads spin for a while waiting for the critical
section to be released. Only after a fixed time interval will some thread get preempted, allowing the holder
of the critical section to complete. The entire time interval is wasted.
One might argue that the above is an implementation weakness. The OpenMP system should
communicate with the operating system and inform it that the other two threads are waiting for a critical
section. The problem is that communicating with the operating system is expensive. The system would
have to do an expensive communication every critical section to improve the performance in the unlikely
case that a thread is preempted at the wrong time. One might also argue that this is a degenerate case:
critical sections are not that common, and the likelihood of a thread being preempted at exactly the wrong
time is very small. Perhaps that is true, but barriers are significantly more common. If the duration of a
parallel loop is smaller than the interval used by the operating system to multiplex threads, it is likely that
at every parallel loop the program will get stuck waiting for a thread that is currently preempted.
Even ignoring synchronization, running with fewer processors than threads can hinder performance.
Imagine again running with three threads on a two-processor system. Assume that thread 0 starts running
on processor 0 and that thread 1 starts running on processor 1. At some point in time, the operating
system preempts one of the threads, let us say thread 0, and gives processor 0 to thread 2. Processor 1
continues to execute thread 1. After some more time, the operating system decides to restart thread 0.
On which processor should it schedule thread 0? If it schedules it on processor 0, both thread 0 and
thread 2 will get fewer CPU cycles than thread 1. This will lead to a load balancing problem. If it
schedules it on processor 1, all processors will have equal amounts of CPU cycles, but we may have
created a locality problem. The data used by thread 0 might still reside in cache 0. By scheduling thread 0
on processor 1, we might have to move all its data from cache 0 to cache 1.
We have given reasons why using more threads than the number of processors can lead to performance
problems. Do the same problems occur when each parallel application uses less threads than
processors, but the set of applications running together in aggregate uses more threads than processors?
Given, for example, two applications, each requesting all the processors in the system, the operating
system could give each application half the processors. Such a scheduling is known as space sharing.
While space sharing is very effective for applications using dynamic threads, without dynamic threads
each application would likely have more threads than allotted processors, and performance would suffer
greatly. Alternatively, the operating system can use gang scheduling—scheduling the machine so that an
application is only run when all of its threads can be scheduled. Given two applications, each wanting all
the processors in the system, a gang scheduler would at any given time give all the processors to one of
the applications, alternating over time which of the applications gets the processors. For programs run
without dynamic threads, gang scheduling can greatly improve performance over space sharing, but
using dynamic threads and space sharing can improve performance compared to no dynamic threads
and gang scheduling.
There are at least three reasons why gang scheduling can be less effective than space sharing with
dynamic threads. First, gang scheduling can lead to locality problems. Each processor, and more
importantly each cache, is being shared by multiple applications. The data used by the multiple
applications will interfere with each other, leading to more cache misses. Second, gang scheduling can
create packaging problems. Imagine, for example, a 16-processor machine running two applications,
each using nine threads. Both applications cannot run concurrently using gang scheduling, so the system
will alternate between the two applications. Since each one only uses nine threads, seven of the
processors on the system will be idle. Finally, many applications' performance does not scale linearly with
the number of processors. For example, consider running two applications on a two-processor machine.
Assume that each one individually gets a speedup of 1.6 on two processors. With gang scheduling, each
application will only get the processors half the time. In the best case, each application will get a speedup
of 0.8 (i.e., a slowdown of 20%) compared to running serially on an idle machine. With space sharing,
each application will run as a uniprocessor application, and it is possible that each application will run as
fast as it did running serially on an idle machine.