RDataFrame concepts

(5 minutes)

Let’s start with some definitions. For purposes of this tutorial, an n-tuple, a spreadsheet, and a dataframe are all the same thing.1\(^,\)2 It’s something that looks like this:

A spreadsheet

Figure 44: You saw this in the class introduction. These are the first few rows and columns in the n-tuple tree1 in file experiment.root.

Some more equivalences: a row in the spreadsheet can also be called an entry in the n-tuple; a column in the spreadsheet is a branch in the n-tuple.

Note

In ROOT, the individual cells can have full-fledged C++ structures in them. To keep things simple I’m sticking with numeric values (leaves in ROOT’s terminology) for this tutorial.

Since we can think of Figure 44 as a spreadsheet, let’s think of the kinds of physics-analysis tasks we might do with the columns and rows in a program like Microsoft Excel, Google Sheets, or Apple Numbers:

  • Sum the values in a column. While this comes up a lot in the business world, it’s not common in a physics analysis.

  • Statistics: Take the mean or standard deviation of a column, or find its minimum or maximum value.

  • Make a histogram of the values in a column. You’ve already done this if you went through the TreeViewer section.

  • Add new columns to the spreadsheet, with the new columns derived from formulas based on existing columns.

The idea behind RDataFrame is to provide a simple way to perform tasks like these.


1

I frequently switch from one term to the other, sometimes within the middle of the same sentence.

2

The term “dataframe” is also an important component of the Python data analysis package pandas, the R programming language, and the HDF5 file format. It pretty much means the same thing in all these environments.

If you’re curious why high-energy physics prefers to use the ROOT file format compared to HDF5, here’s a 2018 paper comparing the use of different file formats and databases in a typical analysis. The TL;DR version: HDF5 is better at storing large multi-dimensional arrays, often found in HPC applications associated with Deep Learning. ROOT is a better choice for storing complex data structures.