Notes on Numba’s threading implementation
The execution of the work presented by the Numba parallel
targets is
undertaken by the Numba threading layer. Practically, the “threading layer”
is a Numba built-in library that can perform the required concurrent execution.
At the time of writing there are three threading layers available, each
implemented via a different lower level native threading library. More
information on the threading layers and appropriate selection of a threading
layer for a given application/system can be found in the
threading layer documentation.
The pertinent information to note for the following sections is that the
function in the threading library that performs the parallel execution is the
parallel_for
function. The job of this function is to both orchestrate and
execute the parallel tasks.
The relevant source files referenced in this document are
numba/np/ufunc/tbbpool.cpp
numba/np/ufunc/omppool.cpp
numba/np/ufunc/workqueue.c
These files contain the TBB, OpenMP, and workqueue threadpool implementations, respectively. Each includes the functions
set_num_threads()
,get_num_threads()
, andget_thread_id()
, as well as the relevant logic for thread masking in their respective schedulers. Note that the basic thread local variable logic is duplicated in each of these files, and not shared between them.numba/np/ufunc/parallel.py
This file contains the Python and JIT compatible wrappers for
set_num_threads()
,get_num_threads()
, andget_thread_id()
, as well as the code that loads the above libraries into Python and launches the threadpool.numba/parfors/parfor_lowering.py
This file contains the main logic for generating code for the parallel backend. The thread mask is accessed in this file in the code that generates scheduler code, and passed to the relevant backend scheduler function (see below).
Thread masking
As part of its design, Numba never launches new threads beyond the threads
that are launched initially with numba.np.ufunc.parallel._launch_threads()
when the first parallel execution is run. This is due to the way threads were
already implemented in Numba prior to thread masking being implemented. This
restriction was kept to keep the design simple, although it could be removed
in the future. Consequently, it’s possible to programmatically set the number
of threads, but only to less than or equal to the total number that have
already been launched. This is done by “masking” out unused threads, causing
them to do no work. For example, on a 16 core machine, if the user were to
call set_num_threads(4)
, Numba would always have 16 threads present, but
12 of them would sit idle for parallel computations. A further call to
set_num_threads(16)
would cause those same threads to do work in later
computations.
Thread masking was added to make it possible for a user to programmatically alter the number of threads performing work in the threading layer. Thread masking proved challenging to implement as it required the development of a programming model that is suitable for users, easy to reason about, and could be implemented safely, with consistent behavior across the various threading layers.
Programming model
The programming model chosen is similar to that found in OpenMP. The reasons
for this choice were that it is familiar to a lot of users, restricted in
scope and also simple. The number of threads in use is specified by calling
set_num_threads
and the number of threads in use can be queried by calling
get_num_threads
.These two functions are synonymous with their OpenMP
counterparts (with the above restriction that the mask must be less than or
equal to the number of launched threads). The execution semantics are also
similar to OpenMP in that once a parallel region is launched, altering the
thread mask has no impact on the currently executing region, but will have an
impact on parallel regions executed subsequently.
The Implementation
So as to place no further restrictions on user code other than those that
already existed in the threading layer libraries, careful consideration of the
design of thread masking was required. The “thread mask” cannot be stored in a
global value as concurrent use of the threading layer may result in classic
forms of race conditions on the value itself. Numerous designs were discussed
involving various types of mutex on such a global value, all of which were
eventually broken through thought experiment alone. It eventually transpired
that, following some OpenMP implementations, the “thread mask” is best
implemented as a thread local
. This means each thread that executes a Numba
parallel function will have a thread local storage (TLS) slot that contains the
value of the thread mask to use when scheduling threads in the parallel_for
function.
The above notion of TLS use for a thread mask is relatively easy to implement,
get_num_threads
and set_num_threads
simply need to address the TLS slot
in a given threading layer. This also means that the execution schedule for a
parallel region can be derived from a run time call to get_num_threads
. This
is achieved via a well known and relatively easy to implement pattern of a C
library function registration and wrapping it in the internal Numba
implementation.
In addition to satisfying the original upfront thread masking requirements, a few more complicated scenarios needed consideration as follows.
Nested parallelism
In all threading layers a “main thread” will invoke the parallel_for
function and then in the parallel region, depending on the threading layer,
some number of additional threads will assist in doing the actual work.
If the work contains a call to another parallel function (i.e. nested
parallelism) it is necessary for the thread making the call to know what the
“thread mask” of the main thread is so that it can propagate it into the
parallel_for
call it makes when executing the nested parallel function.
The implementation of this behavior is threading layer specific but the general
principle is for the “main thread” to always “send” the value of the thread mask
from its TLS slot to all threads in the threading layer that are active in the
parallel region. These active threads then update their TLS slots with this
value prior to performing any work. The net result of this implementation detail
is that:
thread masks correctly propagate into nested functions
it’s still possible for each thread in a parallel region to safely have a different mask with which to call nested functions, if it’s not set explicitly then the inherited mask from the “main thread” is used
threading layers which have dynamic scheduling with threads potentially joining and leaving the active pool during a
parallel_for
execution are successfully accommodatedany “main thread” thread mask is entirely decoupled from the in-flux nature of the thread masks of the threads in the active thread pool
Python threads independently invoking parallel functions
The threading layer launch sequence is heavily guarded to ensure that the
launch is both thread and process safe and run once per process. In a system
with numerous Python threading
module threads all using Numba, the first
thread through the launch sequence will get its thread mask set appropriately,
but no further threads can run the launch sequence. This means that other
threads will need their initial thread mask set some other way. This is
achieved when get_num_threads
is called and no thread mask is present, in
this case the thread mask will be set to the default. In the implementation,
“no thread mask is present” is represented by the value -1
and the “default
thread mask” (unset) is represented by the value 0
. The implementation also
immediately calls set_num_threads(NUMBA_NUM_THREADS)
after doing this, so
if either -1
or 0
is encountered as a result from get_num_threads()
it
indicates a bug in the above processes.
OS fork()
calls
The use of TLS was also in part driven by the Linux (the most popular
platform for Numba use by far) having a fork(2, 3P)
call that will do TLS
propagation into child processes, see clone(2)
‘s CLONE_SETTLS
.
Thread ID
A private get_thread_id()
function was added to each threading backend,
which returns a unique ID for each thread. This can be accessed from Python by
numba.np.ufunc.parallel._get_thread_id()
(it can also be used inside a
JIT compiled function). The thread ID function is useful for testing that the
thread masking behavior is correct, but it should not be used outside of the
tests. For example, one can call set_num_threads(4)
and then collect all
unique _get_thread_id()
s in a parallel region to verify that only 4
threads are run.
Caveats
Some caveats to be aware of when testing thread masking:
The TBB backend may choose to schedule fewer than the given mask number of threads. Thus a test such as the one described above may return fewer than 4 unique threads.
The workqueue backend is not threadsafe, so attempts to do multithreading nested parallelism with it may result in deadlocks or other undefined behavior. The workqueue backend will raise a SIGABRT signal if it detects nested parallelism.
Certain backends may reuse the main thread for computation, but this behavior shouldn’t be relied upon (for instance, if propagating exceptions).
Use in Code Generation
The general pattern for using get_num_threads
in code generation is
from llvmlite import ir as llvmir
get_num_threads = cgutils.get_or_insert_function(builder.module
llvmir.FunctionType(llvmir.IntType(types.intp.bitwidth), []),
name="get_num_threads")
num_threads = builder.call(get_num_threads, [])
with cgutils.if_unlikely(builder, builder.icmp_signed('<=', num_threads,
num_threads.type(0))):
cgutils.printf(builder, "num_threads: %d\n", num_threads)
context.call_conv.return_user_exc(builder, RuntimeError,
("Invalid number of threads. "
"This likely indicates a bug in Numba.",))
# Pass num_threads through to the appropriate backend function here
See the code in numba/parfors/parfor_lowering.py
.
The guard against num_threads
being <= 0 is not strictly necessary, but it
can protect against accidentally incorrect behavior in case the thread masking
logic contains a bug.
The num_threads
variable should be passed through to the appropriate
backend function, such as do_scheduling
or parallel_for
. If it’s used
in some way other than passing it through to the backend function, the above
considerations should be taken into account to ensure the use of the
num_threads
variable is safe. It would probably be better to keep such
logic in the threading backends, rather than trying to do it in code
generation.
Parallel Chunksize Details
There are some cases in which the actual parallel work chunk sizes may differ
from the requested
chunk size that is requested through numba.set_parallel_chunksize()
.
First, if the number of required chunks based on the specified chunk size
is less than the number of configured threads then Numba will use all of the configured
threads to execute the parallel region. In this case, the actual chunk size will be
less than the requested chunk size. Second, due to truncation, in cases where the
iteration count is slightly less than a multiple of the chunk size
(e.g., 14 iterations and a specified chunk size of 5), the actual chunk size will be
larger than the specified chunk size. As in the given example, the number of chunks
would be 2 and the actual chunk size would be 7 (i.e. 14 / 2). Lastly, since Numba
divides an N-dimensional iteration space into N-dimensional (hyper)rectangular chunks,
it may be the case there are not N integer factors whose product is equal to the chunk
size. In this case, some chunks will have an area/volume larger than the chunk size
whereas others will be less than the specified chunk size.