# Dataframes Using dataframes, each of the Exercises 2 through 9 in this tutorial can be written in 3 lines of code. You don't need to create event loops with macros or analysis skeletons; the `RDataFrame` class and its associated methods handle all of that for you.[^f103]$^,$[^f104] The simplest way to view `RDataframe` is as a way to treat an n-tuple like a spreadsheet. You can derive new columns and perform operations on a column like counting entries, summing the values, or finding the maximum/minimum value. You can also perform row-wise operations such as applying a cut (a "filter" in `RDataFrame`). Here's a simple example in Python: :::{code-block} python import ROOT dataframe = ROOT.RDataFrame("tree1","experiment.root") example = dataframe.Define("pt","sqrt(px*px + py*py)") .Filter("pz < 145").Count().Histo1D("pt") ::: This defines a dataframe that contains our standard example n-tuple `tree1` from `experiment.root`. It then applies the following operations to the n-tuple: - defines a new column, `pt`, from a formula; - applies a cut of `pz < 145` to all the rows; - counts the number of rows that pass the cut; - makes a histogram of `pt` for all the rows that pass the cut. To access the value of the number of rows that pass the cut: print ("The number of events with pz < 145 is",example.GetValue()) To draw the histogram, assuming you've defined a suitable canvas: example.Draw() For a more complete example, including the equivalent code in C++, copy the Python notebook `RDataFrameExercises.ipynb` from `~seligman/root-class` and open it via Jupyter. You can also see other examples in these two equivalent areas (see {ref}`references`): - The tutorials area `$ROOTSYS/tutorials/dataframe`; - [Dataframe tutorials](https://root.cern.ch/doc/master/group__tutorial__dataframe.html). Here are other advantages of RDataFrame: - It's easy to set up RDataFrame to use multiple threads, which greatly speeds up execution, though at the expense of losing control of the order in which n-tuple rows are read/written. You can turn this feature on in Python by adding the line: ROOT.ROOT.EnableImplicitMT In C++: ROOT::EnableImplicitMT(); - Although I only show examples using the n-tuple `tree1`, you can also use other file formats as input to dataframes; e.g., TTrees and CSV files. - As noted above, you don't have to worry about event loops. - You can easily save modified dataframes (via the `Snapshot` method) to preserve the work you've done. With all these benefits, why didn't I just use `RDataframe` in {ref}`cpath` and {ref}`pythonpath` and save you some of the hassle? - A teaching reason: To be able to work with dataframes, you need to have some formal understanding of reading rows via an event loop. It's hard to do that without seeing loop code at least once. - Another teaching reason: You need to know how to code loops (and other control structures) in Python and C++ if you're to use those languages for anything other than ROOT. - `RDataFrame` "stages" its tasks using a technique called "lazy evaluation." This means `RDataFrame` won't read the dataframe from disk until the first actual call that requires using the data to compute a value. Consider: :::{code-block} python countPz = dataframe.Filter("pz < 145").Count() hist = dataframe.Define("pt","sqrt(px*px + py*py)") .Define("theta","atan2(pt,pz)").Histo1D("pt") print ("The number of events with pz < 145 is",countPz.GetValue()) hist.Draw() ::: When you execute the above code, RDataFrame will "stack" the `Filter`, `Define`, and `Histo1D` actions. It will only read the n-tuple when it executes `countPz.GetValue`, which requires a concrete numeric value. As it reads the n-tuple it will perform all the stacked actions. This means you want to have a strong sense of what RDataFrame actions are staged and which retrieve values. Consider the following code, which just moves a single line compared to the above code: :::{code-block} python countPz = dataframe.Filter("pz < 145").Count() print ("The number of events with pz < 145 is",countPz.GetValue()) hist = dataframe.Define("pt","sqrt(px*px + py*py)") .Define("theta","atan2(pt,pz)").Histo1D("pt") hist.Draw() ::: If you execute this code, `RDataFrame` will read the n-tuple to get the value of `countPz`. It will then stage two more `Define` actions and the `Histo1D` action, and the read the n-tuple again to be able to draw the histogram. Do things right, and you'll only read an n-tuple from disk once. Do things wrong, and you could get a slow program that reads an n-tuple from disk over and over again. - `RDataFrame` is reasonably easy to use if all you need are its basic actions. But if you want to do something that requires you to write custom code, its difficulty can ramp up. The [RDataFrame class reference](https://root.cern/doc/master/classROOT_1_1RDataFrame.html) has details. With all that said, I'm in favor of using dataframes and plan to use `RDataFrame` in my projects in the future. But this is definitely a subject in which you can become an expert faster than I can! [^f103]: The term "dataframe" is also an important component of the Python data analysis package [pandas](https://pandas.pydata.org/docs/getting_started/index.html) and the [R programming language](https://www.tutorialspoint.com/r/r_data_frames.htm). Don't confuse ROOT's dataframes with pandas' or R's. There is some overlap of concepts, but they're different things with the same name. [^f104]: The current RDataFrame class was introduced in ROOT 6.14. From ROOT 6.10 to 6.12, the class was called `ROOT::Experimental::TDataFrame`. Prior to 6.10, you won't find dataframes in ROOT at all. Since this is an actively evolving feature of ROOT, you'll want to check which version of ROOT your collaboration uses. The Nevis notebook server uses the latest stable version of ROOT, but collaborations often stick with a particular older ROOT version.