(lazy-evaluation)= # Lazy Evaluation **(15 minutes)** While working through this tutorial, you might have observed a couple of curious things: - In {numref}`Figure %s ` in {ref}`transformation`, the diagram elements are not in the order that they were created in {numref}`Listing %s `. - As you enter the `RDataFrame`-related code for the Walkthroughs and Exercises, the system seems to pause at unpredictable times. You already know the reason for both those things is [lazy evaluation](https://medium.com/background-thread/what-is-lazy-evaluation-programming-word-of-the-day-8a6f4410053f), because you read the title of this page. In general {dfn}`lazy evaluation` means that the program only performs one or more operations when it needs to actually evaluate something; otherwise, it saves a of list of the operations it has to perform if an evaluation is required.[^haskell] [^haskell]: While it's possible to implement lazy evaluation in many programming languages, there's one programming language in which lazy evaluation is fundamental: [Haskell](https://www.haskell.org/). You may have noticed that it's available as one of the kernels on our {ref}`notebook server `. I have not yet seen Haskell used in particle physics. I suggest you only explore it if you want to learn a new programming paradigm for its own sake. :::{figure-md} haskell-fig :align: center xkcd haskell by Randall Munroe ::: That may be difficult to grasp. Let's work through an example step-by-step. Start a new notebook kernel or command-line session. Execute the following lines one-by-one, either each in its own notebook cell or one line at a time (translating from Python to C++ as needed). import ROOT That line took a while, but that's to be expected as you're setting up a large library like ROOT. (In C++, the same thing happens when you start a notebook/session.) dataframe = ROOT.RDataFrame("tree1","experiment.root") That line was fast. Let's keep going. ptcuthist = dataframe.Define("pt","sqrt(px*px+py*py)") \ .Filter("pt<10") \ .Histo1D("pt") That took no time at all, and it was a complex line. Hmm... canvas = ROOT.TCanvas() ROOT defines canvases quickly. Good! ptcuthist.Draw() Now we have a delay! Drawing the histogram onto the canvas takes a long time, but the complicated definition of **`ptcuthist`** took almost no time. :::{note} If you want, you can finish by drawing the canvas: canvas.Draw() That might take a few seconds, but the delay can be attribute to the system putting together the graphics resources to create the image. ::: That was an example of lazy evaluation. `RDataFrame` built up a list of tasks in memory, but didn't perform any of those tasks yet. Only when you typed `ptcuthist.Draw()` did the program have to actually read the n-tuple (also known as {dfn}`performing the event loop`) in order to draw the histogram. :::::{admonition} Let's get precise :class: tip You might object that I used hazy, relative phrases like "no time at all" and "delay". If you did, congratulations! You're thinking like a scientist. Let's make a measurement. If you're working in a notebook, or you're using {command}`ipython`, you have access to the `%%time` {ref}`magic command `. Repeat the above walkthrough, but put `%%time` as the first line of every cell. For example: :::{code-block} python # First cell %%time dataframe = ROOT.RDataFrame("tree1","experiment.root") # Second cell %%time ptcuthist = dataframe.Define("pt","sqrt(px*px+py*py)") \ .Filter("pt<10") \ .Histo1D("pt") ::: ...and so on. The exact values you get will depend on many factors; e.g., whether you're doing this walkthrough in C++ or Python; which Jupyter server you're using; how many other users are on the same system. That's why I'm not quoting any hard numbers here. When I try this on my Jupyter notebook, I see that the execution speed of `ptcuthist.Draw()` is on the order of a couple of seconds, while the other commands take less than a second at most. ::::: Why is this important? Consider: :::{code-block} python countPz = dataframe.Filter("pz < 145").Count() hist = dataframe.Define("pt","sqrt(px*px + py*py)") .Define("theta","atan2(pt,pz)").Histo1D("pt") print ("The number of events with pz < 145 is",countPz.GetValue()) hist.Draw() ::: When you execute the above code, RDataFrame will "stack" the `Filter`, `Define`, and `Histo1D` actions. It will only perform the event loop when it executes `countPz.GetValue()`, which requires a concrete numeric value. As it reads the n-tuple it will implement all the stacked actions. This means you want to have a sense of when you're asking `RDataFrame` to evaluate a result. Consider the following code, which just moves a single line compared to the above code: :::{code-block} python countPz = dataframe.Filter("pz < 145").Count() print ("The number of events with pz < 145 is",countPz.GetValue()) hist = dataframe.Define("pt","sqrt(px*px + py*py)") .Define("theta","atan2(pt,pz)").Histo1D("pt") hist.Draw() ::: If you execute this code, `RDataFrame` will read the n-tuple to get the result of `countPz.GetValue()`. It will then stage two more `Define` actions and the `Histo1D` action, and then perform the event loop again to be accumulate the data for `hist.Draw()`. Do things right, and you'll only perform the event loop once. Do things wrong, and you could get a slow program that reads an n-tuple from disk over and over again. :::::{admonition} An advanced example I thought about giving this as an Exercise, but decided against it because it involves Python and C++ programming language features that I haven't discussed. If you think you could have done it on your own, let me know; maybe it will be an Exercise the next time I teach this tutorial. The goal: Make a histogram of every variable in an n-tuple. Do this without knowing in advance what the columns are. Keep lazy evaluation in mind: you do _not_ want to read the entire n-tuple each time you plot a new column; you only want to read the n-tuple once. Here are my solutions: :::{code-block} c++ :name: loop-c-code :caption: Plotting every variable in an n-tuple (C++) // Assume the n-tuple has already been defined in // an RDataFrame named "dataframe". Use the // GetColumnNames method (see the RDataFrame web page) // to get a list of the variable names. auto names = dataframe.GetColumnNames(); auto length = names.size(); // Create one histogram for every column. std::vector histograms(length); for ( int i = 0; i < length; ++i ) { histograms[i] = *( dataframe.Histo1D( names[i] ) ); } // Create one canvas for each histogram. // Draw the histogram on that canvas. std::vector canvases(length); for ( int i = 0; i < length; ++i ) { canvases[i].cd(); histograms[i].Draw(); canvases[i].Draw(); } ::: :::{code-block} python :name: loop-python-code :caption: Plotting every variable in an n-tuple (Python) # Assume the n-tuple has already been defined in # an RDataFrame named "dataframe". Use the # GetColumnNames method (see the RDataFrame web page) # to get a list of the variable names. names = dataframe.GetColumnNames() length = len(names) # Create one histogram for every column. histograms = [] for i in range(length): histograms.append( dataframe.Histo1D( names[i] ) ) # Create one canvas for each histogram. # Draw the histogram on that canvas. canvases = [] for i in range(length): canvases.append( ROOT.TCanvas() ) histograms[i].Draw() canvases[i].Draw() ::: :::{tip} There are some subtle operational differences between these two pieces of code (use of vectors vs. lists; when the histogram and canvas objects are created; the dereferencing operation `*` in the C++ code). But let's focus on the lazy-evaluation aspect. ::: Note that I define all the histograms in their own loop using `Histo1D` _before_ I use `Draw()` on any of them. This means the n-tuple will be read only once, when **`histograms[0].Draw()`** is executed. What would happen if I didn't think about lazy evaluation and did something like this? :::{code-block} python names = dataframe.GetColumnNames() length = len(names) for i in range(length): hist = dataframe.Histo1D( names[i] ) canvas = ROOT.TCanvas() hist.Draw() canvas.Draw() ::: I'd be performing the event loop multiple times, once for every column in the n-tuple. :::{tip} Also, I might only get a histogram of the last column for the reasons discussed in {ref}`two-histogram`. ::: ::::: :::{figure-md} wall_art-fig :align: center xkcd wall_art by Randall Munroe. The relevance of this cartoon will become apparent if you execute one of the "double-loop" code examples above. :::