Walkthrough: Making scatterplots

(15 minutes)

Now that we’ve had some practice making one-dimensional histograms, let’s make a two-dimensional histogram. Let’s see if we can take the same approach that we did for Exercise 1. To make a 1-D histogram we used Histo1D; when I look at the RDataFrame web page I see there’s a Histo2D method. So it’s obvious that it should be something like:

Listing 17: RDataFrame - does this make a 2-D histogram (C++)

hist2dim = dataframe.Histo2D("ebeam","px");
hist2dim->Draw();
canvas.Draw();

Listing 18: RDataFrame - does this make a 2-D histogram (Python)

hist2dim = dataframe.Histo2D("ebeam","px")
hist2dim.Draw()
canvas.Draw()

Give it a try!

Hey, what’s happening? Am I being sneaky again?

Not this time. This one of those (fortunately rare) cases where ROOT is not uniform in its approach. In order to make a 2D histogram with RDataFrame, you have to supply the same parameters to Histo2D as if you were to create such a histogram “by hand.”

If you look up TH2D, in analogy with TH1D, you’ll see that the arguments to TH2D are something like:

hist2d = TH2D("name","title",nxbins,xlo,xhi,nybins,ylo,yhi)

where:

"name" is the ROOT name of the histogram;
"title" is the title of histogram, which is shown at the top of the plot;
nxbins is the number of bins on the x-axis;
xlo is the lower limit of the x-axis of the plot;
xhi is the upper limit of the x-axis of the plot;
nybins is the number of bins on the y-axis;
ylo is the lower limit of the y-axis of the plot;
yhi is the upper limit of the y-axis of the plot.

When using RDataFrame, you have explicitly supply these values to Histo2D like this:

Histo2D(("name", "title", nxbins, xlo, xhi, nybins, ylo, yhi),"ebeam","px")

Here’s how it looks in the actual code, specifying the TH2D parameters in an initializer list in the respective languages:

Listing 19: RDataFrame - making a 2-D histogram (C++)

hist2dim = dataframe.Histo2D({"hist2dim", "ebeam vs px", 100, 149, 151, 100, -20, 20},"ebeam","px");
hist2dim->Draw();
canvas.Draw();

Listing 20: RDataFrame - making a 2-D histogram (Python)

hist2dim = dataframe.Histo2D(("hist2dim", "ebeam vs px", 100, 149, 151, 100, -20, 20),"ebeam","px")
hist2dim.Draw()
canvas.Draw()

Give it a try!

Note

This is a scatterplot, a handy way of observing the correlations between two variables. The Histo2D command interprets the last two variables as “x”,“y” to define which axes to use.

It’s easy to fall into the trap of thinking that each (x,y) point on a scatterplot represents two values in your n-tuple. The scatterplot is a grid; each square in the grid is randomly populated with a density of dots proportional to the number of values in that square.

This leads to the question: How did I know the values for xlo, xhi, ylo, and yhi in the above examples? The answer is that I made 1-D plots for the variables so I knew their range, then used those values for the 2-D axis limits.1\(^,\)2

Now that you have the recipe, try making scatterplots of different pairs of variables. Do you see any correlations?

Note

If you see a shapeless blob on the scatterplot, the variables are likely to be uncorrelated; for example, plot px versus py. If you see a pattern, there may be a correlation; for example, plot pz versus zv. It appears that the higher pz is, the lower zv is, and vice versa. Perhaps the particle loses energy before it is deflected in the target.

xkcd correlation — Figure 51: https://xkcd.com/552/ by Randall Munroe

1

There’s another obvious question: Why is this necessary? The Histo1D method is able to automatically determine the scale of its single x-axis; why can’t Histo2D do the same for its axes?

I hunted for the reason, and finally asked the question on the ROOT Forums. The answer has to do with being able to use RDataFrame with multiple threads, a subject I address in the intermediate topics section. While running with multiple execution threads, the ROOT developers can make automatic scaling of Histo1D work, but they haven’t figured out how to make automatic axis scaling work with Histo2D (or Histo3D, for that matter).

The lesson here: Even though RDataFrame is generally easier to use (yes, really!) than the techniques described in The C++ Path or The Python Path, there are still times when you have deal with ROOT’s peculiarities.

2

You can also explicitly specify the parameters when creating a 1-D histogram; e.g.,

 hist1 = dataframe.Histo1D(("h1", "ebeam", 100, 149, 151),"ebeam")

You might want to do this if you want to override the automatic histogram limits, or you want to set the histogram title.