Topology and affinity in multicore environments 281
This would make sure that all threads run on a set of eight different physical cores.
It does, however, not bind the threads to those cores; threads can still move inside
the defined mask. Although the OS tends to prevent multiple threads from running
on the same core, they could still change places and destroy NUMA locality. Like-
wise, if we set OMP_NUM_THREADS=4, it would be unspecified which of the eight
cores are utilized. In summary, taskset is more suited for binding single-threaded
processes, but can still be useful if a group of threads runs in a very “symmetric”
environment, like an L2 cache group.
In a production environment, a taskset-like mechanism should also be inte-
grated into the MPI starting process (i.e., mpirun or one of its variants). A simple
workaround that works with most MPI installations is to use taskset instead of
the actual MPI binary:
1 $ mpirun -npernode 8 taskset -c 0-7 ./a.out
The -npernode 8 option specifies that only eight processes per node should be
started. Every MPI process (a.out) is run under control of taskset with a 0xFF
affinity mask and thus cannot leave the set of eight distinct physical cores. However,
as in the previous example the processes are allowed to move freely inside this set.
NUMA locality will only be unharmed if the kernel does a good job of maintaining
affinity by default. Moreover, the method in this simple form works only if every
MPI process can be started with the same taskset wrapper.
Real binding of multiple (OpenMP) threads and/or (MPI) processes from out-
side the application is more complex. First of all, there is no ready-made tool avail-
able that works in all system environments. Furthermore, compilers usually gener-
ate code that starts at least one “shepherd thread” in addition to the number given
in OMP_NUM_THREADS. Shepherd threads do not execute application code and
should thus not be bound, so any external tool must have a concept about how and
when shepherd threads are started (this is compiler-dependent, of course). Luckily,
OpenMP implementations under Linux are usually based on POSIX threads, and
OpenMP threads are created using the pthread_create() function either when
the binary starts or when the first parallel region in encountered. By overloading
pthread_create() it is possible to intercept thread creation, pin the applica-
tion’s threads, and skip the shepherds in a configurable way [T31]. This works even
with MPI/OpenMP hybrid code, where additional MPI shepherd processes add to
complexity. The LIKWID tool suite [T20, W120] contains a lightweight program
called likwid-pin, which can bind the threads of a process to specific cores on a
node. In order to define which threads should not be bound (the shepherd threads), a
skip mask has to be specified:
1 likwid-pin -s <hex skip mask> -c <core list> <command> [args]
Bit b in the skip mask is associated with the (b+ 1)-th thread that is created via the
pthread_create() function. If a bit is set, the corresponding thread will not be
bound. The core list has the same syntax as with taskset. A typical usage pattern
for an OpenMP binary generated by the Intel compiler would be: