(threads)= # Multiple threads in RDataFrame An [execution thread](https://en.wikipedia.org/wiki/Thread_(computing)) represents the independent execution of a set of programmed instructions. The basic model is that each core in our computer represents an opportunity to run its own program or process simultaneously with all the other cores. {ref}`RDataFrame ` is designed to go through an n-tuple row-by-row and perform some action on it. To a large degree, the processing of one row has no impact on the processing of any other row. That makes `RDataFrame` an ideal application for multiple threads; each thread processes a single row, and many rows can be processed at once. This can greatly speed up execution. It's easy to set up `RDataFrame` to use multiple threads. You can turn this feature on by adding a line: :::{code-block} c++ // In C++: ROOT::EnableImplicitMT(); ::: :::{code-block} python # In Python: ROOT.ROOT.EnableImplicitMT() ::: :::::{admonition} Turn on threading first :class: warning The `EnableImplicitMT` function must be called _before_ you define a dataframe; e.g., :::{code-block} python ROOT.ROOT.EnableImplicitMT() dataframe = ROOT.RDataframe("tree1","experiment.root") ::: If you do it the other way around, you'll get an error message. The reason is that when ROOT creates an `RDataFrame`, it optimizes the set-up based on the number of available threads. If you invoke `EnableImplicitMT` after you define an `RDataFrame`, ROOT will complain that the number of threads available disagrees with the number of threads assumed for the `RDataFrame` definition. ::::: If you want to turn off multi-threading, replace `EnableImplicitMT` with `DisableImplicitMT`. To see how many threads are available to your program: :::{code-block} c++ // Number of threads in C++ auto poolSize = ROOT::GetThreadPoolSize(); std::cout << "Pool size = " << poolSize << std::endl; ::: :::{code-block} python # Number of threads in Python poolSize = ROOT.GetThreadPoolSize() print ("Pool size =",poolSize) ::: :::{admonition} Notes :class: note - If you get a pool size of 0, it means that multi-threading is turned off. It doesn't mean that nothing executes! - The whole point of multi-threading is that each row in an n-tuple is processed individually. This means that you lose control of the order in which n-tuple rows are read/written. For a simple n-tuple like `tree1` in `experiment.root`, this doesn't matter. But if an analysis requires that the rows in an n-tuple be in a particular order, you can't use multi-threading. In that case, you may want to consider {ref}`batch-systems`. - Most `RDataFrame` operations are compatible with multi-threading, but a few are not. These are labeled as "single-thread only" in the [RDataFrame documentation](https://root.cern/doc/master/classROOT_1_1RDataFrame.html). An example of this is `Range(m,n)`, which selects just those entries from rows `m` through `n` (not including `n` itself). If it's not clear why, think about what would happen if you applied a `Filter()` before using `Range()`; remember that with multi-threading you're processing the rows in an unpredicatable order. - If you're {ref}`defining your own functions `, you have to be careful. Writing thread-safe code is hard; I have an extended discussion of this in a footnote in the {ref}`appendix on batch systems `. If you want to write your own functions and make them compatible with multi-threading, look at `DefineSlot` in the [RDataFrame documentation](https://root.cern/doc/master/classROOT_1_1RDataFrame.html). - `EnableImplicitMT` only enables the _possibility_ of multiple threads. The reality may be more complicated. Originally I meant for this page in the tutorial to be an Exercise for you. The problem is that `experiment.root` is a teeny-tiny n-tuple: only about 2MB in size; only 7 variables per row; only 100,000 events. When I tested it, ROOT determined that multi-threading would not be useful in this case, and refused to allocate more than one execution thread to the `RDataFrame`. The result was that the multi-threaded example took _longer_ than the single-threaded example, because of the overhead in setting up the multi-threaded environment. After a [lengthy discussion on the ROOT forums](https://root-forum.cern.ch/t/simple-way-to-test-prove-enableimplicitmt-performance/54183/15), I tried creating a larger version of `experiment.root` with more events. It took a file 1000 times larger (100,000,000 events in a 2GB file) to see a measurable difference with multiple threads. However, the results were inconsistent. I even crashed the notebook server somehow! I got better results when I divided that big file to 1000 smaller files and read them as a {ref}`TChain `. Although that was a more realistic approach to multi-threading, at that point I decided that whole enterprise was too complex to present to you as an Exercise. The take-away from this long bullet point: It does no harm to put `EnableImplicitMT` at the top of your code; most of the [ROOT RDataFrame examples](https://root.cern.ch/doc/master/group__tutorial__dataframe.html) do this. But it might not give you any benefit for smaller analysis tasks. :::