Lazy Evaluation

(15 minutes)

While working through this tutorial, you might have observed a couple of curious things:

In Figure 54 in Transformations and Actions, the diagram elements are not in the order that they were created in Listing 26.
As you enter the RDataFrame-related code for the Walkthroughs and Exercises, the system seems to pause at unpredictable times.

You already know the reason for both those things is lazy evaluation, because you read the title of this page. In general lazy evaluation means that the program only performs one or more operations when it needs to actually evaluate something; otherwise, it saves a of list of the operations it has to perform if an evaluation is required.1

That may be difficult to grasp. Let’s work through an example step-by-step. Start a new notebook kernel or command-line session. Execute the following lines one-by-one, either each in its own notebook cell or one line at a time (translating from Python to C++ as needed).

import ROOT

That line took a while, but that’s to be expected as you’re setting up a large library like ROOT. (In C++, the same thing happens when you start a notebook/session.)

dataframe = ROOT.RDataFrame("tree1","experiment.root")

That line was fast. Let’s keep going.

ptcuthist = dataframe.Define("pt","sqrt(px*px+py*py)") \
                     .Filter("pt<10") \
                     .Histo1D("pt")

That took no time at all, and it was a complex line. Hmm…

canvas = ROOT.TCanvas()

ROOT defines canvases quickly. Good!

ptcuthist.Draw()

Now we have a delay! Drawing the histogram onto the canvas takes a long time, but the complicated definition of ptcuthist took almost no time.

Note

If you want, you can finish by drawing the canvas:

canvas.Draw()

That might take a few seconds, but the delay can be attribute to the system putting together the graphics resources to create the image.

That was an example of lazy evaluation. RDataFrame built up a list of tasks in memory, but didn’t perform any of those tasks yet. Only when you typed ptcuthist.Draw() did the program have to actually read the n-tuple (also known as performing the event loop) in order to draw the histogram.

Let’s get precise

You might object that I used hazy, relative phrases like “no time at all” and “delay”. If you did, congratulations! You’re thinking like a scientist.

Let’s make a measurement. If you’re working in a notebook, or you’re using ipython, you have access to the %%time magic command. Repeat the above walkthrough, but put %%time as the first line of every cell. For example:

# First cell
%%time
dataframe = ROOT.RDataFrame("tree1","experiment.root")

# Second cell
%%time
ptcuthist = dataframe.Define("pt","sqrt(px*px+py*py)") \
                     .Filter("pt<10") \
                     .Histo1D("pt")

…and so on.

The exact values you get will depend on many factors; e.g., whether you’re doing this walkthrough in C++ or Python; which Jupyter server you’re using; how many other users are on the same system. That’s why I’m not quoting any hard numbers here.

When I try this on my Jupyter notebook, I see that the execution speed of ptcuthist.Draw() is on the order of a couple of seconds, while the other commands take less than a second at most.

Why is this important? Consider:

countPz = dataframe.Filter("pz < 145").Count()
hist = dataframe.Define("pt","sqrt(px*px + py*py)") 
    .Define("theta","atan2(pt,pz)").Histo1D("pt")
print ("The number of events with pz < 145 is",countPz.GetValue())
hist.Draw()

When you execute the above code, RDataFrame will “stack” the Filter, Define, and Histo1D actions. It will only perform the event loop when it executes countPz.GetValue(), which requires a concrete numeric value. As it reads the n-tuple it will implement all the stacked actions.

This means you want to have a sense of when you’re asking RDataFrame to evaluate a result. Consider the following code, which just moves a single line compared to the above code:

countPz = dataframe.Filter("pz < 145").Count()
print ("The number of events with pz < 145 is",countPz.GetValue())
hist = dataframe.Define("pt","sqrt(px*px + py*py)") 
    .Define("theta","atan2(pt,pz)").Histo1D("pt")
hist.Draw()

If you execute this code, RDataFrame will read the n-tuple to get the result of countPz.GetValue(). It will then stage two more Define actions and the Histo1D action, and then perform the event loop again to be accumulate the data for hist.Draw().

Do things right, and you’ll only perform the event loop once. Do things wrong, and you could get a slow program that reads an n-tuple from disk over and over again.

An advanced example

I thought about giving this as an Exercise, but decided against it because it involves Python and C++ programming language features that I haven’t discussed. If you think you could have done it on your own, let me know; maybe it will be an Exercise the next time I teach this tutorial.

The goal: Make a histogram of every variable in an n-tuple. Do this without knowing in advance what the columns are. Keep lazy evaluation in mind: you do not want to read the entire n-tuple each time you plot a new column; you only want to read the n-tuple once.

Here are my solutions:

Listing 27: Plotting every variable in an n-tuple (C++)

// Assume the n-tuple has already been defined in
// an RDataFrame named "dataframe". Use the
// GetColumnNames method (see the RDataFrame web page)
// to get a list of the variable names.
auto names = dataframe.GetColumnNames();
auto length = names.size();

// Create one histogram for every column.
std::vector<TH1D> histograms(length);
for ( int i = 0; i < length; ++i ) {
    histograms[i] = *( dataframe.Histo1D( names[i] ) );
}

// Create one canvas for each histogram. 
// Draw the histogram on that canvas.
std::vector<TCanvas> canvases(length);
for ( int i = 0; i < length; ++i ) {
    canvases[i].cd();
    histograms[i].Draw();
    canvases[i].Draw();
}

Listing 28: Plotting every variable in an n-tuple (Python)

# Assume the n-tuple has already been defined in
# an RDataFrame named "dataframe". Use the
# GetColumnNames method (see the RDataFrame web page)
# to get a list of the variable names.
names = dataframe.GetColumnNames()
length = len(names)

# Create one histogram for every column.
histograms = []
for i in range(length):
    histograms.append( dataframe.Histo1D( names[i] ) )

# Create one canvas for each histogram. 
# Draw the histogram on that canvas.
canvases = []
for i in range(length):
    canvases.append( ROOT.TCanvas() )
    histograms[i].Draw()
    canvases[i].Draw()

Tip

There are some subtle operational differences between these two pieces of code (use of vectors vs. lists; when the histogram and canvas objects are created; the dereferencing operation * in the C++ code). But let’s focus on the lazy-evaluation aspect.

Note that I define all the histograms in their own loop using Histo1D before I use Draw() on any of them. This means the n-tuple will be read only once, when histograms[0].Draw() is executed.

What would happen if I didn’t think about lazy evaluation and did something like this?

names = dataframe.GetColumnNames()
length = len(names)
for i in range(length):
    hist = dataframe.Histo1D( names[i] )
    canvas = ROOT.TCanvas()
    hist.Draw()
    canvas.Draw()

I’d be performing the event loop multiple times, once for every column in the n-tuple.

Tip

Also, I might only get a histogram of the last column for the reasons discussed in Exercise 5: Two histograms at the same time.

xkcd wall_art — Figure 55: https://xkcd.com/2018/ by Randall Munroe. The relevance of this cartoon will become apparent if you execute one of the “double-loop” code examples above.

1

While it’s possible to implement lazy evaluation in many programming languages, there’s one programming language in which lazy evaluation is fundamental: Haskell. You may have noticed that it’s available as one of the kernels on our notebook server.

I have not yet seen Haskell used in particle physics. I suggest you only explore it if you want to learn a new programming paradigm for its own sake.

xkcd haskell — Figure 56: https://xkcd.com/1312/ by Randall Munroe