(rdf-make-scatterplots)=

# Walkthrough: Making scatterplots
**(15 minutes)**

Now that we've had some practice making one-dimensional histograms, let's 
make a two-dimensional histogram. Let's see if we can take the same approach
that we did for {ref}`Exercise 1 <detective-work>`. To make a 1-D histogram we used
`Histo1D`; when I look at the [RDataFrame web page](https://root.cern/doc/master/classROOT_1_1RDataFrame.html)
I see there's a `Histo2D` method. So it's obvious that it should be something like:

:::{code-block} c++
:name: cpp-rdf-try-2d
:caption: RDataFrame - does this make a 2-D histogram (C++)
hist2dim = dataframe.Histo2D("ebeam","px");
hist2dim->Draw();
canvas.Draw();
:::

:::{code-block} python
:name: python-rdf-try-2d
:caption: RDataFrame - does this make a 2-D histogram (Python)
hist2dim = dataframe.Histo2D("ebeam","px")
hist2dim.Draw()
canvas.Draw()
:::

Give it a try!

Hey, what's happening? Am I being sneaky again?

Not this time. This one of those (fortunately rare) cases where ROOT is not 
uniform in its approach. In order to make a 2D histogram with `RDataFrame`, you have to supply
the same parameters to `Histo2D` as if you were to create such a
histogram "by hand."

If you look up [TH2D](https://root.cern/doc/master/classTH2D.html), in analogy
with [TH1D](https://root.cern/doc/master/classTH1D.html), you'll see that the
arguments to `TH2D` are something like:

    hist2d = TH2D("name","title",nxbins,xlo,xhi,nybins,ylo,yhi)

where:

   - `"name"` is the ROOT name of the histogram;
   - `"title"` is the title of histogram, which is shown at the top of the plot;
   - `nxbins` is the number of bins on the x-axis;
   - `xlo` is the lower limit of the x-axis of the plot;
   - `xhi` is the upper limit of the x-axis of the plot;
   - `nybins` is the number of bins on the y-axis;
   - `ylo` is the lower limit of the y-axis of the plot;
   - `yhi` is the upper limit of the y-axis of the plot.

When using `RDataFrame`, you have explicitly supply these values to `Histo2D` like this:

    Histo2D(("name", "title", nxbins, xlo, xhi, nybins, ylo, yhi),"ebeam","px")

Here's how it looks in the actual code, specifying the `TH2D` parameters in an
initializer list in the respective languages:

:::{code-block} c++
:name: cpp-rdf-2d
:caption: RDataFrame - making a 2-D histogram (C++)
hist2dim = dataframe.Histo2D({"hist2dim", "ebeam vs px", 100, 149, 151, 100, -20, 20},"ebeam","px");
hist2dim->Draw();
canvas.Draw();
:::

:::{code-block} python
:name: python-rdf-2d
:caption: RDataFrame - making a 2-D histogram (Python)
hist2dim = dataframe.Histo2D(("hist2dim", "ebeam vs px", 100, 149, 151, 100, -20, 20),"ebeam","px")
hist2dim.Draw()
canvas.Draw()
:::

Give it a try!

:::{note}
This is a scatterplot, a handy way of observing the correlations between
two variables. The `Histo2D` command interprets the last two variables as
"x","y" to define which axes to use.

It's easy to fall into the trap of thinking that each (x,y) point on a
scatterplot represents two values in your n-tuple. The scatterplot is a
grid; each square in the grid is randomly populated with a density of
dots proportional to the number of values in that square.
:::

This leads to the question: How did I know the values for `xlo`, `xhi`, `ylo`, and `yhi` in the
above examples? The answer is that I made 1-D plots for the variables so I knew their range,
then used those values for the 2-D axis limits.[^why2d]$^,$[^histo1d]

[^why2d]: There's another obvious question: Why is this necessary? The `Histo1D` method is able
    to automatically determine the scale of its single x-axis; why can't `Histo2D` do the same
    for its axes?

    I hunted for the reason, and finally asked the question on the
    [ROOT
    Forums](https://root-forum.cern.ch/t/using-rdataframe-histo2d/29408).
    The answer has to do with being able to use `RDataFrame`
    {ref}`with multiple threads <threads>`, a subject I address in the
    {ref}`intermediate topics <intermediate>` section. While running
    with multiple execution threads, the ROOT developers can make
    automatic scaling of `Histo1D` work, but they haven't figured out
    how to make automatic axis scaling work with `Histo2D` (or
    `Histo3D`, for that matter).

    The lesson here: Even though `RDataFrame` is generally easier to use (yes, really!) 
    than the techniques described in {ref}`cpath` or {ref}`pythonpath`, there are still
    times when you have deal with ROOT's peculiarities. 

[^histo1d]: You can also explicitly specify the parameters when creating a 1-D histogram; e.g.,

         hist1 = dataframe.Histo1D(("h1", "ebeam", 100, 149, 151),"ebeam")

    You might want to do this if you want to override the automatic histogram limits, or
    you want to set the histogram title. 

Now that you have the recipe, try making scatterplots of different
pairs of variables. Do you see any correlations?

:::{note}
If you see a shapeless blob on the scatterplot, the variables are likely
to be uncorrelated; for example, plot **`px`** versus **`py`**. If you see a
pattern, there may be a correlation; for example, plot **`pz`** versus
**`zv`**. It appears that the higher **`pz`** is, the lower **`zv`** is, and
vice versa. Perhaps the particle loses energy before it is deflected in
the target.
:::

:::{figure-md} correlation-fig
:align: center

<img src="https://imgs.xkcd.com/comics/correlation.png" alt="xkcd correlation" width="75%">

<https://xkcd.com/552/> by Randall Munroe
:::