Advanced histogramming notes

The entries in this section were long footnotes in a previous edition of this tutorial. I decided to move them here to make The Basics, The C++ Path, and The Python Path less cluttered.1

Mean and StdDev, with weights

All the histograms you’ve made have a “Mean” and “StdDev” in the upper-right-hand corner of the plot. In case you need a formal definition, the mean \(\overline{x}\) and standard deviation \(\sigma\) are defined by:

(1)\[\overline{x} = \frac{1}{N}\sum_{i}^{}{x_{i}y_{i}}\]

(2)\[\sigma^{2} = \frac{1}{N}\sum_{i}^{}{y_{i}\left( x_{i} - \overline{x} \right)}^{2}\]

where \(i\) goes over the histogram bins, \(x_{i}\) is the center of the bin on the \(x\)-axis, \(y_{i}\) is the sum of the weights in that bin, and \(N\) is the sum of the weights in the histogram.

I’m being super-formal here, in that I’m referring to \(y_{i}\) and \(N\) as “sums of weights.” That’s because, as I noted when we worked with the Treeviewer, you can assign a weight to the entries as you fill a histogram. You do this by adding an extra argument to the Fill method. For example, this adds a value to a histogram with the default weight of 1.0:

hist.Fill(value)

while this adds that same value with a weight of 1/error:

hist.Fill(value, 1.0/error)

For all of the histograms in this tutorial, even in the advanced exercises, the default weight of 1.0 is sufficient. In that case, \(y_{i}\) is just the number of times you filled a value in bin \(i\) and \(N\) is the total number of times you filled the histogram.

However, there are a number of reasons why you’d want to reweight data as you binned it into a histogram. One example, which I hinted at above, is when each individual value has an error associated with it. You might want to weight the values so that those with large errors have little weight and those with small errors have big weights.

The values of Mean and StdDev in the plot do not include any values you filled that fall outside the \(x\)-axis of the histogram. This is easy to test if you have any doubts:

TH1D hist("hist","test",100,0,3)
hist.FillRandom("gaus",1000)
hist.Draw()

xkcd geothmetic_meandian — Figure 67: https://xkcd.com/2435/ by Randall Munroe

Alternate Gaussian parameterization

In ROOT’s TFormula notation, the “gaus” function refers to Equation (3), where \(A, \mu, \sigma\) refer to parameters \(P_0\), \(P_1\), \(P_2\) respectively. This is usually fine, except when you want a distribution that’s normalized so that total area under the Gaussian distribution is unity. For that, use “gausn” instead, which uses the probability distribution function in Equation (4).

With this different normalization, \(P_0\) divided by the bin width becomes the number of “equivalent events” in the histogram; that’s the “area” of the Gaussian distribution expressed in units of events with weight=1:

[] TFitResultPtr fitResult = hist.Fit("gausn")
[] Double_t numberEquivalentEvents = fitResult->Parameter(0) /
    hist.GetBinWidth(0)

Histogram arithmetic

Suppose you’re told to fill two histograms, then perform arithmetic on them; most often this will be adding histograms, but you can also subtract, multiply, and divide histogram contents. If you do this, call the “Sumw2” method of both histograms before you fill them; e.g.,

TH1* hist1 = new TH1D(...);
TH1* hist2 = new TH1D(...);
hist1->Sumw2();
hist2->Sumw2();

// Fill your histograms
hist1->Fill(...);
hist2->Fill(...);

// Add hist2 to the contents of hist1:
hist1->Add(hist2);

If you forget Sumw2, then your error bars after the arithmetic won’t be correct.

General rule: If you’re going to perform histogram arithmetic, use Sumw2 (which means “sum the squares of the weights”). Some physicists use Sumw2 all the time when they define their histograms, just in case.

Automatic histogram binning

As you’ve noticed when working with a command like tree1->Draw("zv"), ROOT can automatically determine the appropriate axis range of a plot for you. You can use the same trick. It is:

    TH1* hist = new TH1D(...); // define your histogram
    hist->SetCanExtend(TH1::kXaxis); // allow the histogram to re-bin itself
    hist->Sumw2(); // so the error bars are correct after re-binning

“Re-binning” means that if a value is supplied to the histogram that’s outside its limits, it will adjust those limits automatically. It does this by summing existing adjacent bins then doubling the bin width; the bin limits change, while the number of histogram bins remains constant.

1: “What? There was a version of this tutorial that was more cluttered?” Just for that, you get to work through all the exercises in the Advanced Exercises and Expert Exercises.