Multiple threads in RDataFrame

An execution thread represents the independent execution of a set of programmed instructions. The basic model is that each core in our computer represents an opportunity to run its own program or process simultaneously with all the other cores.

RDataFrame is designed to go through an n-tuple row-by-row and perform some action on it. To a large degree, the processing of one row has no impact on the processing of any other row. That makes RDataFrame an ideal application for multiple threads; each thread processes a single row, and many rows can be processed at once. This can greatly speed up execution.

It’s easy to set up RDataFrame to use multiple threads. You can turn this feature on by adding a line:

// In C++:
ROOT::EnableImplicitMT();

# In Python:
ROOT.ROOT.EnableImplicitMT()

Turn on threading first

The EnableImplicitMT function must be called before you define a dataframe; e.g.,

ROOT.ROOT.EnableImplicitMT()
dataframe = ROOT.RDataframe("tree1","experiment.root")

If you do it the other way around, you’ll get an error message. The reason is that when ROOT creates an RDataFrame, it optimizes the set-up based on the number of available threads. If you invoke EnableImplicitMT after you define an RDataFrame, ROOT will complain that the number of threads available disagrees with the number of threads assumed for the RDataFrame definition.

Turning it off

If you want to turn off multi-threading, replace EnableImplicitMT with DisableImplicitMT.

To see how many threads are available to your program:

// Number of threads in C++
auto poolSize = ROOT::GetThreadPoolSize();
std::cout << "Pool size = " << poolSize << std::endl;

# Number of threads in Python
poolSize = ROOT.GetThreadPoolSize()
print ("Pool size =",poolSize)

Notes

If you get a pool size of 0, it means that multi-threading is turned off. It doesn’t mean that nothing executes!

The whole point of multi-threading is that each row in an n-tuple is processed individually. This means that you lose control of the order in which n-tuple rows are read/written.

For a simple n-tuple like tree1 in experiment.root, this doesn’t matter. But if an analysis requires that the rows in an n-tuple be in a particular order, you can’t use multi-threading. In that case, you may want to consider Batch Systems.

Most RDataFrame operations are compatible with multi-threading, but a few are not. These are labeled as “single-thread only” in the RDataFrame documentation.

An example of this is Range(m,n), which selects just those entries from rows m through n (not including n itself). If it’s not clear why, think about what would happen if you applied a Filter() before using Range(); remember that with multi-threading you’re processing the rows in an unpredicatable order.

If you’re defining your own functions, you have to be careful. Writing thread-safe code is hard; I have an extended discussion of this in a footnote in the appendix on batch systems.

If you want to write your own functions and make them compatible with multi-threading, look at DefineSlot in the RDataFrame documentation. I also have an example program STLntupleRDF (see ROOT Dictionaries for the context).

EnableImplicitMT only enables the possibility of multiple threads. The reality may be more complicated.

Originally I meant for this page in the tutorial to be an Exercise for you. The problem is that experiment.root is a teeny-tiny n-tuple: only about 2MB in size; only 7 variables per row; only 100,000 events.

When I tested it, ROOT determined that multi-threading would not be useful in this case, and refused to allocate more than one execution thread to the RDataFrame. The result was that the multi-threaded example took longer than the single-threaded example, because of the overhead in setting up the multi-threaded environment.

After a lengthy discussion on the ROOT forums, I tried creating a larger version of experiment.root with more events. It took a file 1000 times larger (100,000,000 events in a 2GB file) to see a measurable difference with multiple threads. However, the results were inconsistent. I even crashed the notebook server somehow!

I got better results when I divided that big file to 1000 smaller files and read them as a TChain. Although that was a more realistic approach to multi-threading, at that point I decided that whole enterprise was too complex to present to you as an Exercise.

The take-away from this long bullet point: It does no harm to put EnableImplicitMT at the top of your code; most of the ROOT RDataFrame examples do this. But it might not give you any benefit for smaller analysis tasks.