Abstract

This document is a pedagogical introduction to statistics for particle physics. Emphasis is placed on the terminology, concepts, and methods being used at the Large Hadron Collider. The document addresses both the statistical tests applied to a model of the data and the modeling itself. The doucment lives on GitHub and authorea; the initial arxiv version is 1503.07622.

It is often said that the language of science is mathematics. It could well be said that the language of experimental science is statistics. It is through statistical concepts that we quantify the correspondence between theoretical predictions and experimental observations. While the statistical analysis of the data is often treated as a final subsidiary step to an experimental physics result, a more direct approach would be quite the opposite. In fact, thinking through the requirements for a robust statistical statement is an excellent way to organize an analysis strategy.

In these lecture notes^{1} I will devote significant attention to the strategies used in high-energy physics for developing a statistical model of the data. This modeling stage is where you inject your understanding of the physics. I like to think of the modeling stage in terms of a conversation. When your colleague asks you over lunch to explain your analysis, you tell a story. It is a story about the signal and the backgrounds – are they estimated using Monte Carlo simulations, a side-band, or some data-driven technique? Is the analysis based on counting events or do you use some discriminating variable, like an invariant mass or perhaps the output of a multivariate discriminant? What are the dominant uncertainties in the rate of signal and background events and how do you estimate them? What are the dominant uncertainties in the shape of the distributions and how do you estimate them? The answer to these questions forms a *scientific narrative*; the more convincing this narrative is the more convincing your analysis strategy is. The statistical model is the mathematical representation of this narrative and you should strive for it to be as faithful a representation as possible.

Once you have constructed a statistical model of the data, the actual statistical procedures should be relatively straight forward. In particular, the statistical tests can be written for a generic statistical model without knowledge of the physics behind the model. The goal of the `RooStats`

project was precisely to provide statistical tools based on an arbitrary statistical model implemented with the `RooFit`

modeling language. While the formalism for the statistical procedures can be somewhat involved, the logical justification for the procedures is based on a number of abstract properties for the statistical procedures. One can follow the logical argument without worrying about the detailed mathematical proofs that the procedures have the required properties. Within the last five years there has been a significant advance in the field’s understanding of certain statistical procedures, which has led to to some commonalities in the statistical recommendations by the major LHC experiments. I will review some of the most common statistical procedures and their logical justification.

These notes borrow significantly from other documents that I am writing contemporaneously; specifically Ref.(G. Cowan, K. Cranmer, E. Gross, O. Vitells 2011), documentation for

`HistFactory`

(Cranmer) and the ATLAS Higgs combination.↩

This section specifies my notations and conventions, which I have chosen with some care.

\[\int f(x) \;dx\;= 1\;.\]

Figure \ref{fig:hierarchy} establishes a hierarchy that is fairly general for the context of high-energy physics. Imagine the search for the Higgs boson, in which the search is composed of several “channels” indexed by \(c\). Here a channel is defined by its associated event selection criteria, not an underlying physical process. In addition to the number of selected events, \(n_c\), each channel may make use of some other measured quantity, \(x_c\), such as the invariant mass of the candidate Higgs boson. The quantities will be called “observables” and will be written in roman letters e.g. \(x_c\). The notation is chosen to make manifest that the observable \(x\) is frequentist in nature. Replication of the experiment many times will result in different values of \(x\) and this ensemble gives rise to a *probability density function* (pdf) of \(x\), written \(f(x)\), which has the important property that it is normalized to unity

\[f(x | \alpha) \;,\]

In the case of discrete quantities, such as the number of events satisfying some event selection, the integral is replaced by a sum. Often one considers a parametric family of pdfs

read “\(f\) of \(x\) given \(\alpha\)” and, henceforth, referred to as a *probability model* or just *model*. The parameters of the model typically represent parameters of a physical theory or an unknown property of the detector’s response. The parameters are not frequentist in nature, thus any probability statement associated with \(\alpha\) is Bayesian.