# Dataframes

Using dataframes, each of the Exercises 2 through 9 in this tutorial can be written
in 3 lines of code. You don't need to create event loops with macros or
analysis skeletons; the `RDataFrame` class and its associated methods
handle all of that for you.[^f103]$^,$[^f104]

The simplest way to view `RDataframe` is as a way to treat an n-tuple like
a spreadsheet. You can derive new columns and perform operations on a
column like counting entries, summing the values, or finding the
maximum/minimum value. You can also perform row-wise operations such as
applying a cut (a "filter" in `RDataFrame`).

Here's a simple example in Python:

:::{code-block} python
import ROOT
dataframe = ROOT.RDataFrame("tree1","experiment.root")
example = dataframe.Define("pt","sqrt(px*px + py*py)") 
    .Filter("pz < 145").Count().Histo1D("pt")
:::

This defines a dataframe that contains our standard example n-tuple
`tree1` from `experiment.root`. It then applies the following
operations to the n-tuple:

-   defines a new column, `pt`, from a formula;

-   applies a cut of `pz < 145` to all the rows;

-   counts the number of rows that pass the cut;

-   makes a histogram of `pt` for all the rows that pass the cut.

To access the value of the number of rows that pass the cut:

    print ("The number of events with pz < 145 is",example.GetValue())

To draw the histogram, assuming you've defined a suitable canvas:

    example.Draw()

For a more complete example, including the equivalent code in C++, copy
the Python notebook `RDataFrameExercises.ipynb` from
`~seligman/root-class` and open it via Jupyter. You can also see
other examples in these two equivalent areas (see {ref}`references`):

-   The tutorials area `$ROOTSYS/tutorials/dataframe`;

-   [Dataframe tutorials](https://root.cern.ch/doc/master/group__tutorial__dataframe.html).

Here are other advantages of RDataFrame:

-   It's easy to set up RDataFrame to use multiple threads, which
    greatly speeds up execution, though at the expense of losing control
    of the order in which n-tuple rows are read/written. You can turn
    this feature on in Python by adding the line:

        ROOT.ROOT.EnableImplicitMT

    In C++:

        ROOT::EnableImplicitMT();

-   Although I only show examples using the n-tuple `tree1`, you can
    also use other file formats as input to dataframes; e.g., TTrees and
    CSV files.

-   As noted above, you don't have to worry about event loops.

-   You can easily save modified dataframes (via the `Snapshot`
    method) to preserve the work you've done.

With all these benefits, why didn't I just use `RDataframe` in {ref}`cpath`
and {ref}`pythonpath` and save you some of the hassle?

-   A teaching reason: To be able to work with dataframes, you need to
    have some formal understanding of reading rows via an event loop.
    It's hard to do that without seeing loop code at least once.

-   Another teaching reason: You need to know how to code loops (and
    other control structures) in Python and C++ if you're to use those
    languages for anything other than ROOT.

-   `RDataFrame` "stages" its tasks using a technique called "lazy
    evaluation." This means `RDataFrame` won't read the dataframe from
    disk until the first actual call that requires using the data to
    compute a value.

    Consider:

    :::{code-block} python
    countPz = dataframe.Filter("pz < 145").Count()
    hist = dataframe.Define("pt","sqrt(px*px + py*py)") 
        .Define("theta","atan2(pt,pz)").Histo1D("pt")
    print ("The number of events with pz < 145 is",countPz.GetValue())
    hist.Draw()
    :::

    When you execute the above code, RDataFrame will "stack" the
    `Filter`, `Define`, and `Histo1D` actions. It will only read the
    n-tuple when it executes `countPz.GetValue`, which requires a
    concrete numeric value. As it reads the n-tuple it will perform all the
    stacked actions.

    This means you want to have a strong sense of what RDataFrame actions
    are staged and which retrieve values. Consider the following code,
    which just moves a single line compared to the above code:

    :::{code-block} python
    countPz = dataframe.Filter("pz < 145").Count()
    print ("The number of events with pz < 145 is",countPz.GetValue())
    hist = dataframe.Define("pt","sqrt(px*px + py*py)") 
        .Define("theta","atan2(pt,pz)").Histo1D("pt")
    hist.Draw()
    :::

    If you execute this code, `RDataFrame` will read the n-tuple to get the
    value of `countPz`. It will then stage two more `Define` actions
    and the `Histo1D` action, and the read the n-tuple again to be able
    to draw the histogram.

    Do things right, and you'll only read an n-tuple from disk once. Do
    things wrong, and you could get a slow program that reads an n-tuple
    from disk over and over again.

-   `RDataFrame` is reasonably easy to use if all you need are its basic
    actions. But if you want to do something that requires you to
    write custom code, its difficulty can ramp up. The [RDataFrame
    class
    reference](https://root.cern/doc/master/classROOT_1_1RDataFrame.html)
    has details.

With all that said, I'm in favor of using dataframes and plan to use
`RDataFrame` in my projects in the future. But this is definitely a
subject in which you can become an expert faster than I can!

[^f103]: The term "dataframe" is also an important component of the
    Python data analysis package
    [pandas](https://pandas.pydata.org/docs/getting_started/index.html)
    and the [R programming
    language](https://www.tutorialspoint.com/r/r_data_frames.htm). Don't
    confuse ROOT's dataframes with pandas' or R's. There is some overlap
    of concepts, but they're different things with the same name.

[^f104]: The current RDataFrame class was introduced in ROOT 6.14. From
    ROOT 6.10 to 6.12, the class was called
    `ROOT::Experimental::TDataFrame`. Prior to 6.10, you won't find
    dataframes in ROOT at all. Since this is an actively evolving
    feature of ROOT, you'll want to check which version of ROOT your
    collaboration uses. 

    The Nevis notebook server uses the latest stable
    version of ROOT, but collaborations often stick with a particular
    older ROOT version.