ROUGH DRAFT authorea.com/20483

# Practical Statistics for the LHC

Abstract

This document is a pedagogical introduction to statistics for particle physics. Emphasis is placed on the terminology, concepts, and methods being used at the Large Hadron Collider. The document addresses both the statistical tests applied to a model of the data and the modeling itself. The doucment lives on GitHub and authorea; the initial arxiv version is 1503.07622.

# Introduction

It is often said that the language of science is mathematics. It could well be said that the language of experimental science is statistics. It is through statistical concepts that we quantify the correspondence between theoretical predictions and experimental observations. While the statistical analysis of the data is often treated as a final subsidiary step to an experimental physics result, a more direct approach would be quite the opposite. In fact, thinking through the requirements for a robust statistical statement is an excellent way to organize an analysis strategy.

In these lecture notes1 I will devote significant attention to the strategies used in high-energy physics for developing a statistical model of the data. This modeling stage is where you inject your understanding of the physics. I like to think of the modeling stage in terms of a conversation. When your colleague asks you over lunch to explain your analysis, you tell a story. It is a story about the signal and the backgrounds – are they estimated using Monte Carlo simulations, a side-band, or some data-driven technique? Is the analysis based on counting events or do you use some discriminating variable, like an invariant mass or perhaps the output of a multivariate discriminant? What are the dominant uncertainties in the rate of signal and background events and how do you estimate them? What are the dominant uncertainties in the shape of the distributions and how do you estimate them? The answer to these questions forms a scientific narrative; the more convincing this narrative is the more convincing your analysis strategy is. The statistical model is the mathematical representation of this narrative and you should strive for it to be as faithful a representation as possible.

Once you have constructed a statistical model of the data, the actual statistical procedures should be relatively straight forward. In particular, the statistical tests can be written for a generic statistical model without knowledge of the physics behind the model. The goal of the RooFit project was precisely to provide statistical tools based on an arbitrary statistical model implemented with the RooAbsPdf modeling language. While the formalism for the statistical procedures can be somewhat involved, the logical justification for the procedures is based on a number of abstract properties for the statistical procedures. One can follow the logical argument without worrying about the detailed mathematical proofs that the procedures have the required properties. Within the last five years there has been a significant advance in the field’s understanding of certain statistical procedures, which has led to to some commonalities in the statistical recommendations by the major LHC experiments. I will review some of the most common statistical procedures and their logical justification.

1. These notes borrow significantly from other documents that I am writing contemporaneously; specifically Ref.(G. Cowan, K. Cranmer, E. Gross, O. Vitells 2011), documentation for RooAbsPdf::getVal(x)   (Cranmer) and the ATLAS Higgs combination.

# Conceptual building blocks for modeling

## Probability densities and the likelihood function

This section specifies my notations and conventions, which I have chosen with some care.

$\int f(x) \;dx\;= 1\;.$

Figure \ref{fig:hierarchy} establishes a hierarchy that is fairly general for the context of high-energy physics. Imagine the search for the Higgs boson, in which the search is composed of several “channels” indexed by $$c$$. Here a channel is defined by its associated event selection criteria, not an underlying physical process. In addition to the number of selected events, $$n_c$$, each channel may make use of some other measured quantity, $$x_c$$, such as the invariant mass of the candidate Higgs boson. The quantities will be called “observables” and will be written in roman letters e.g. $$x_c$$. The notation is chosen to make manifest that the observable $$x$$ is frequentist in nature. Replication of the experiment many times will result in different values of $$x$$ and this ensemble gives rise to a probability density function (pdf) of $$x$$, written $$f(x)$$, which has the important property that it is normalized to unity

$f(x | \alpha) \;,$

In the case of discrete quantities, such as the number of events satisfying some event selection, the integral is replaced by a sum. Often one considers a parametric family of pdfs

read “$$f$$ of $$x$$ given $$\alpha$$” and, henceforth, referred to as a probability model or just model. The parameters of the model typically represent parameters of a physical theory or an unknown property of the detector’s response. The parameters are not frequentist in nature, thus any probability statement associated with $$\alpha$$ is Bayesian.1 In order to make their lack of frequentist interpretation manifest, model parameters will be written in greek letters, e.g.: $$\mu, \theta, \alpha, \nu$$.From the full set of parameters, one is typically only interested in a few: the parameters of interest. The remaining parameters are referred to as nuisance parameters, as we must account for them even though we are not interested in them directly.

While $$f(x)$$ describes the probability density for the observable $$x$$ for a single event, we also need to describe the probability density for a dataset with many events, $${{{\mathcal{D}}}}= \{x_1,\dots,x_{n}\}$$. If we consider the events as independently drawn from the same underlying distribution, then clearly the probability density is just a product of densities for each event. However, if we have a prediction that the total number of events expected, call it $$\nu$$, then we should also include the overall Poisson probability for observing $$n$$ events given $$\nu$$ expected. Thus, we arrive at what statisticians call a marked Poisson model, $\label{eqn:markedPoisson} {{{\mathbf{f}}}}({{{\mathcal{D}}}}|\nu,\alpha) = {{{\rm Pois}}}(n|\nu) \prod_{e=1}^n f(x_e|\alpha) \; ,$ where I use a bold $${{{\mathbf{f}}}}$$ to distinguish it from the individual event probability density $$f(x)$$. In practice, the expectation is often parametrized as well and some parameters simultaneously modify the expected rate and shape, thus we can write $$\nu\rightarrow\nu(\alpha)$$. In aureplacedverbaa both $$f$$ and $${{{\mathbf{f}}}}$$ are implemented with a aureplacedverbaaa ; where aureplacedverbaaaa always provides the value of $$f(x)$$ and depending on aureplacedverbaaaaa the value of $$\nu$$ is accessed via aureplacedverbaaaaaa .

${\int L(\alpha) \;d\alpha ~~\mathtt{ Need~Not~Equal}~~ 1}\; .$

The likelihood function $$L(\alpha)$$ is numerically equivalent to $$f(x|\alpha)$$ with $$x$$ fixed – or $${{{\mathbf{f}}}}({{{\mathcal{D}}}}|\alpha)$$ with  fixed. The likelihood function should not be interpreted as a probability density for $$\alpha$$. In particular, the likelihood function does not have the property that it normalizes to unity

It is common to work with the log-likelihood (or negative log-likelihood) function. In the case of a marked Poisson, we have what is commonly referred to as an extended maximum likelihood (Barlow 1990) \begin{aligned} -\ln L( \alpha) &=& \underbrace{\nu(\alpha) - n \ln \nu(\alpha)}_{\rm extended~term} - \sum_{e=1}^n \ln f(x_e) + \underbrace{~\ln n! ~}_{\mathrm{constant}}\; .\end{aligned} To reiterate the terminology, probability density function refers to the value of $$f$$ as a function of $$x$$ given a fixed value of $$\alpha$$; likelihood function refers to the value of $$f$$ as a function of $$\alpha$$ given a fixed value of $$x$$; and model refers to the full structure of $$f(x|\alpha)$$.

Probability models can be constructed to simultaneously describe several channels, that is several disjoint regions of the data defined by the associated selection criteria. I will use $$e$$ as the index over events and $$c$$ as the index over channels. Thus, the number of events in the $$c^{\rm th}$$ channel is $$n_c$$ and the value of the $$e^{\rm th}$$ event in the $$c^{\rm th}$$ channel is $$x_{ce}$$. In this context, the data is a collection of smaller datasets: . In aureplacedverbaaaaaaa the index $$c$$ is referred to as a aureplacedverbaaaaaaaa and it is used to inside the dataset to differentiate events associated to different channels or categories. The class aureplacedverbaaaaaaaaa associates the dataset $${{{\mathcal{D}}}}_c$$ with the corresponding marked Poisson model. The key point here is that there are now multiple Poisson terms. Thus we can write the combined (or simultaneous) model $\label{eqn:simultaneous} {{{\mathbf{f}}}}_{\textrm{sim}}({{{\mathcal{D}_{\textrm{sim}}}}}|\alpha) = \prod_{c\in\rm channels} \left[ {{{\rm Pois}}}(n_c|\nu(\alpha)) \prod_{e=1}^{n_c} f(x_{ce}|\alpha) \right] \; ,$ remembering that the symbol product over channels has implications for the structure of the dataset.

1. Note, one can define a conditional distribution $$f(x|y)$$ when the joint distribution $$f(x,y)$$ is defined in a frequentist sense.