Dataframes

Using dataframes, each of the Exercises 2 through 9 in this tutorial can be written in 3 lines of code. You don’t need to create event loops with macros or analysis skeletons; the RDataFrame class and its associated methods handle all of that for you.1\(^,\)2

The simplest way to view RDataframe is as a way to treat an n-tuple like a spreadsheet. You can derive new columns and perform operations on a column like counting entries, summing the values, or finding the maximum/minimum value. You can also perform row-wise operations such as applying a cut (a “filter” in RDataFrame).

Here’s a simple example in Python:

import ROOT
dataframe = ROOT.RDataFrame("tree1","experiment.root")
example = dataframe.Define("pt","sqrt(px*px + py*py)") 
    .Filter("pz < 145").Count().Histo1D("pt")

This defines a dataframe that contains our standard example n-tuple tree1 from experiment.root. It then applies the following operations to the n-tuple:

  • defines a new column, pt, from a formula;

  • applies a cut of pz < 145 to all the rows;

  • counts the number of rows that pass the cut;

  • makes a histogram of pt for all the rows that pass the cut.

To access the value of the number of rows that pass the cut:

print ("The number of events with pz < 145 is",example.GetValue())

To draw the histogram, assuming you’ve defined a suitable canvas:

example.Draw()

For a more complete example, including the equivalent code in C++, copy the Python notebook RDataFrameExercises.ipynb from ~seligman/root-class and open it via Jupyter. You can also see other examples in these two equivalent areas (see References):

Here are other advantages of RDataFrame:

  • It’s easy to set up RDataFrame to use multiple threads, which greatly speeds up execution, though at the expense of losing control of the order in which n-tuple rows are read/written. You can turn this feature on in Python by adding the line:

    ROOT.ROOT.EnableImplicitMT
    

    In C++:

    ROOT::EnableImplicitMT();
    
  • Although I only show examples using the n-tuple tree1, you can also use other file formats as input to dataframes; e.g., TTrees and CSV files.

  • As noted above, you don’t have to worry about event loops.

  • You can easily save modified dataframes (via the Snapshot method) to preserve the work you’ve done.

With all these benefits, why didn’t I just use RDataframe in The C++ Path and The Python Path and save you some of the hassle?

  • A teaching reason: To be able to work with dataframes, you need to have some formal understanding of reading rows via an event loop. It’s hard to do that without seeing loop code at least once.

  • Another teaching reason: You need to know how to code loops (and other control structures) in Python and C++ if you’re to use those languages for anything other than ROOT.

  • RDataFrame “stages” its tasks using a technique called “lazy evaluation.” This means RDataFrame won’t read the dataframe from disk until the first actual call that requires using the data to compute a value.

    Consider:

    countPz = dataframe.Filter("pz < 145").Count()
    hist = dataframe.Define("pt","sqrt(px*px + py*py)") 
        .Define("theta","atan2(pt,pz)").Histo1D("pt")
    print ("The number of events with pz < 145 is",countPz.GetValue())
    hist.Draw()
    

    When you execute the above code, RDataFrame will “stack” the Filter, Define, and Histo1D actions. It will only read the n-tuple when it executes countPz.GetValue, which requires a concrete numeric value. As it reads the n-tuple it will perform all the stacked actions.

    This means you want to have a strong sense of what RDataFrame actions are staged and which retrieve values. Consider the following code, which just moves a single line compared to the above code:

    countPz = dataframe.Filter("pz < 145").Count()
    print ("The number of events with pz < 145 is",countPz.GetValue())
    hist = dataframe.Define("pt","sqrt(px*px + py*py)") 
        .Define("theta","atan2(pt,pz)").Histo1D("pt")
    hist.Draw()
    

    If you execute this code, RDataFrame will read the n-tuple to get the value of countPz. It will then stage two more Define actions and the Histo1D action, and the read the n-tuple again to be able to draw the histogram.

    Do things right, and you’ll only read an n-tuple from disk once. Do things wrong, and you could get a slow program that reads an n-tuple from disk over and over again.

  • RDataFrame is reasonably easy to use if all you need are its basic actions. But if you want to do something that requires you to write custom code, its difficulty can ramp up. The RDataFrame class reference has details.

With all that said, I’m in favor of using dataframes and plan to use RDataFrame in my projects in the future. But this is definitely a subject in which you can become an expert faster than I can!


1

The term “dataframe” is also an important component of the Python data analysis package pandas and the R programming language. Don’t confuse ROOT’s dataframes with pandas’ or R’s. There is some overlap of concepts, but they’re different things with the same name.

2

The current RDataFrame class was introduced in ROOT 6.14. From ROOT 6.10 to 6.12, the class was called ROOT::Experimental::TDataFrame. Prior to 6.10, you won’t find dataframes in ROOT at all. Since this is an actively evolving feature of ROOT, you’ll want to check which version of ROOT your collaboration uses.

The Nevis notebook server uses the latest stable version of ROOT, but collaborations often stick with a particular older ROOT version.