Multiple threads in RDataFrame
An execution thread represents the independent execution of a set of programmed instructions. The basic model is that each core in our computer represents an opportunity to run its own program or process simultaneously with all the other cores.
RDataFrame is designed to go through an n-tuple row-by-row and perform
some action on it. To a large degree, the processing of one row has no
impact on the processing of any other row. That makes RDataFrame
an ideal
application for multiple threads; each thread
processes a single row, and many rows can be processed at once. This can
greatly speed up execution.
It’s easy to set up RDataFrame
to use multiple threads. You can turn
this feature on by adding a line:
// In C++:
ROOT::EnableImplicitMT();
# In Python:
ROOT.ROOT.EnableImplicitMT()
Turn on threading first
The EnableImplicitMT
function must be called before you
define a dataframe; e.g.,
ROOT.ROOT.EnableImplicitMT()
dataframe = ROOT.RDataframe("tree1","experiment.root")
If you do it the other way around, you’ll get an error message. The
reason is that when ROOT creates an RDataFrame
, it optimizes the
set-up based on the number of available threads. If you invoke
EnableImplicitMT
after you define an RDataFrame
, ROOT will complain that
the number of threads available disagrees with the number of threads
assumed for the RDataFrame
definition.
Turning it off
If you want to turn off multi-threading, replace EnableImplicitMT
with DisableImplicitMT
.
To see how many threads are available to your program:
// Number of threads in C++
auto poolSize = ROOT::GetThreadPoolSize();
std::cout << "Pool size = " << poolSize << std::endl;
# Number of threads in Python
poolSize = ROOT.GetThreadPoolSize()
print ("Pool size =",poolSize)
Notes
If you get a pool size of 0, it means that multi-threading is turned off. It doesn’t mean that nothing executes!
The whole point of multi-threading is that each row in an n-tuple is processed individually. This means that you lose control of the order in which n-tuple rows are read/written.
For a simple n-tuple like
tree1
inexperiment.root
, this doesn’t matter. But if an analysis requires that the rows in an n-tuple be in a particular order, you can’t use multi-threading. In that case, you may want to consider Batch Systems.
Most
RDataFrame
operations are compatible with multi-threading, but a few are not. These are labeled as “single-thread only” in the RDataFrame documentation.An example of this is
Range(m,n)
, which selects just those entries from rowsm
throughn
(not includingn
itself). If it’s not clear why, think about what would happen if you applied aFilter()
before usingRange()
; remember that with multi-threading you’re processing the rows in an unpredicatable order.
If you’re defining your own functions, you have to be careful. Writing thread-safe code is hard; I have an extended discussion of this in a footnote in the appendix on batch systems.
If you want to write your own functions and make them compatible with multi-threading, look at
DefineSlot
in the RDataFrame documentation. I also have an example program STLntupleRDF (see ROOT Dictionaries for the context).
EnableImplicitMT
only enables the possibility of multiple threads. The reality may be more complicated.Originally I meant for this page in the tutorial to be an Exercise for you. The problem is that
experiment.root
is a teeny-tiny n-tuple: only about 2MB in size; only 7 variables per row; only 100,000 events.When I tested it, ROOT determined that multi-threading would not be useful in this case, and refused to allocate more than one execution thread to the
RDataFrame
. The result was that the multi-threaded example took longer than the single-threaded example, because of the overhead in setting up the multi-threaded environment.After a lengthy discussion on the ROOT forums, I tried creating a larger version of
experiment.root
with more events. It took a file 1000 times larger (100,000,000 events in a 2GB file) to see a measurable difference with multiple threads. However, the results were inconsistent. I even crashed the notebook server somehow!I got better results when I divided that big file to 1000 smaller files and read them as a TChain. Although that was a more realistic approach to multi-threading, at that point I decided that whole enterprise was too complex to present to you as an Exercise.
The take-away from this long bullet point: It does no harm to put
EnableImplicitMT
at the top of your code; most of the ROOT RDataFrame examples do this. But it might not give you any benefit for smaller analysis tasks.