Kyle Cranmer renamed sectionPhysics_quest.tex to Physics questions formulated in statistical language.tex  about 9 years ago

Commit id: 0f26dd71382d9857592aa8729c1d990cd4e84d9e

deletions | additions      

         

\section{Physics questions formulated in statistical language}  \subsection{Measurement as parameter estimation}  \label{S:estimation}   One of the most common tasks of the working physicist is to estimate some model parameter. We do it so often, that we often don't realize it. For instance, the sample mean $\bar{x} = \sum_{e=1}^n x_e / n$ is an estimate for the mean, $\mu$, of a Gaussian probability density $f(x|\mu,\sigma) =\Gauss(x|\mu,\sigma)$. More generally, an \textit{estimator} $\hat{\alpha}(\data)$ is some function of the data and its value is used to estimate the true value of some parameter $\alpha$. There are various abstract properties such as variance, bias, consistency, efficiency, robustness, etc~\cite{James}. The bias of an estimator is defined as $B(\hat\alpha) = E[ \hat\alpha ]-\alpha$, where $E$ means the expectation value of \mbox{$E[ \hat\alpha ]=\int\hat\alpha(x) f(x)dx$} or the probability-weighted average. Clearly one would like an unbiased estimator. The variance of an estimator is defined as $var[\hat\alpha] = E[ (\alpha - E[\hat{\alpha}] )^2 ]$; and clearly one would like an estimator with the minimum variance. Unfortunately, there is a tradeoff between bias and variance. Physicists tend to be allergic to biased estimators, and within the class of unbiased estimators, there is a well defined minimum variance bound referred to as the Cram\'er-Rao bound (that is the inverse of the Fisher information, which we will refer to again later).   The most widely used estimator in physics is the maximum likelihood estimator (MLE). It is defined as the value of $\alpha$ which maximizes the likelihood function $L(\alpha)$. Equivalently this value, $\hat{\alpha}$, maximizes $\log L(\alpha)$ and minimizes $-\log L(\alpha)$. The most common tool for finding the maximum likelihood estimator is \texttt{Minuit}, which conventionally minimizes $-\log L(\alpha)$ (or any other function)~\cite{James:1975dr}. The jargon is that one `fits' the function and the maximum likelihood estimate is the `best fit value'.   When one has a multi-parameter likelihood function $L(\vec{\alpha})$, then the situation is slightly more complicated. The maximum likelihood estimate for the full parameter list, $\hat{\vec{\alpha}}$, is clearly defined. The various components $\hat{\alpha}_p$ are referred to as the \textit{unconditional maximum likelihood estimates}. In the physics jargon, one says all the parameters are `floating'. One can also ask about maximum likelihood estimate of $\alpha_p$ is with some other parameters $\vec{\alpha}_o$ fixed; this is called the \textit{conditional maximum likelihood estimate} and is denoted $\hat{\hat{\alpha}}_p(\vec{\alpha}_o)$. These are important quantities for defining the profile likelihood ratio, which we will discuss in more detail later. The concept of variance of the estimates is also generalized to the covariance matrix $cov[\alpha_p, \alpha_{p'}] = E[(\hat\alpha_p - \alpha_p)(\hat\alpha_{p'}- \alpha_{p'})]$ and is often denoted $\Sigma_{pp'}$. Note, the diagonal elements of the covariance matrix are the same as the variance for the individual parameters, ie. $cov[\alpha_p, \alpha_{p}] = var[\alpha_p]$.  In the case of a Poisson model $\Pois(n|\nu)$ the maximum likelihood estimate of $\nu$ is simply \mbox{$\hat{\nu}=n$}. Thus, it follows that the variance of the estimator is $var[\hat{\nu}]=var[n]=\nu$. Thus if the true rate is $\nu$ one expects to find estimates $\hat{\nu}$ with a characteristic spread around $\nu$; it is in this sense that the measurement has a estimate has some uncertainty or `error' of $\sqrt{n}$. We will make this statement of uncertainty more precise when we discuss frequentist confidence intervals.  When the number of events is large, the distribution of maximum likelihood estimates approaches a Gaussian or normal distribution.\footnote{There are various conditions that must be met for this to be true, but skip the fine print in these lectures. There are two conditions that are most often violated in particle physics, which will be addressed later.} This does not depend on the pdf $f(x)$ having a Gaussian form. For small samples this isn't the case, but this limiting distribution is often referred to as an \textit{asymptotic distribution}.  Furthermore, under most circumstances in particle physics, the maximum likelihood estimate approaches the minimum variance or Cram\'er-Rao bound. In particular, the inverse of the covariance matrix for the estimates is asymptotically given by  \begin{equation}  \label{Eq:expfisher}  \Sigma_{pp'}^{-1}(\vec\alpha) = E\left[- \frac{\partial^2 \log f(x|\vec{\alpha})}{\partial\alpha_p \partial_{p'}} \middle| \;\vec\alpha \right ] \;,  \end{equation}  where I have written explicitly that the expectation, and thus the covariance matrix itself, depend on the true value $\vec\alpha$. The right side of Eq.~\ref{Eq:expfisher} is called the (expected) Fisher information matrix. Remember that the expectation involves an integral over the observables. Since that integral is difficult to perform in general, one often uses the observed Fisher information matrix to approximate the variance of the estimator by simply taking the matrix of second derivatives based on the observed data  \begin{equation}  \label{Eq:obsfisher}  \tilde\Sigma_{pp'}^{-1}(\vec\alpha) = - \frac{\partial^2 \log L(\vec{\alpha})}{\partial\alpha_p \partial_{p'}} \; .  \end{equation}  This is what \texttt{Minuit}'s \texttt{Hesse} algorithm\footnote{The matrix is called the Hessian, hence the name.} calculates to estimate the covariance matrix of the parameters.  \subsection{Discovery as hypothesis tests}\label{S:hypothesis test}  Let us examine the statistical statement associated to the claim of discovery for new physics. Typically, new physics searches are looking for a signal that is additive on top of the background, though in some cases there are interference effects that need to be taken into account and one cannot really talk about 'signal' and 'background' in any meaningful way. Discovery is formulated in terms of a hypothesis test where the background-only hypothesis plays the role of the null hypothesis and the signal-plus-background hypothesis plays the roll of the alternative. Roughly speaking, the claim of discovery is a statement that the data are incompatible with the background-only hypothesis. Consider the simplest scenario where one is counting events in the signal region, $n_{\rm SR}$ and expects $\nu_B$ events from background and $\nu_S$ events from the putative signal. Then we have the following hypotheses:  \begin{center}  \begin{tabular}{llll}  symbol & statistical name & physics name & probability model \\ \hline  $H_0$ & null hypothesis & background-only & $\Pois(n_{SR} | \nu_B)$ \\  $H_1$ & alternate hypothesis & signal-plus-background & $\Pois(n_{SR} | \nu_S+\nu_B)$   \end{tabular}  \end{center}  In this simple example it's fairly obvious that evidence for a signal shows up as an excess of events and a reasonable way to quantify the compatibility of the observed data $n_{CR}^0$ and the null hypothesis is to calculate the probability that the background-only would produce at least this many events; the $p$-value  \begin{equation}  p = \sum_{n=n_{SR}^0}^\infty \Pois(n | \nu_B) \; .  \end{equation}  If this $p$-value is very small, then one might choose to reject the null hypothesis.  Note, the $p$-value is \textit{not} a to be interpreted as the probability of the null hypothesis given the data -- that is a manifestly Bayesian statement. Instead, the $p$-value is a statement about the probability to have obtained data with a certain property assuming the null hypothesis.  How do we generalize this to more complicated situations? There were really two ingredients in our simple example. The first was the proposal that we would reject the null hypothesis based on the probability for it to produce data at least as extreme as the observed data. The second ingredient was the prescription for what is meant by more discrepant; in this case the possible observations are ordered according to increasing $n_{SR}$. One could imagine using difference between observed and expected, $n_{SR}-\nu_B$, as the measure of discrepancy. In general, a function that maps the data to a single real number is called a \textit{test statistic}: $T(\data)\to\mathbb{R}$. How does one choose from the infinite number of test statistics?  Neyman and Pearson provided a framework for hypothesis testing that addresses the choice of the test statistic. This setup treats the null and the alternate hypotheses in an asymmetric way. First, one defines an \textit{acceptance region} in terms of a test statistic, such that if $T(\data)< k_\alpha$ one accepts the null hypothesis. One can think of the $T(\data) = k_\alpha$ as defining a contour in the space of the data, which is the boundary of this acceptance region. Next, one defines the \textit{size of the test}, $\alpha$,\footnote{Note, $\alpha$ is the conventional notation for the size of the test, and has nothing to do with a model parameter in Eq.~\ref{Eq:simultaneous}.} as the probability the null hypothesis will be rejected when it is true (a so-called Type-I error). This is equivalent to the probability under the null hypothesis that the data will not be found in this acceptance region, ie. $\alpha = P(T(\data) \ge k_\alpha | H_0)$. Note, it is now clear why there is a subscript on $k_\alpha$, since the contour level is related to the size of the test. In contrast, if one accepts the null hypothesis when the alternate is true, it is called a Type-II error. The probability to commit a Type-II error is denoted as $\beta$ and it is given by $\beta=P(T(\data) < k_\alpha|H_1)$. One calls $1-\beta$ the \textit{power} of the test. With these definitions in place, one looks for a test statistic that maximizes the power of the test for a fixed test size. This is a problem for the calculus of variations, and sounds like it might be very difficult for complicated probability models.   It turns out that in the case of two simple hypotheses (probability models without any parameters), there is a simple solution! In particular, the test statistic leading to the most powerful test is given by the likelihood ratio $T_{NP}(\data) = \f(\data|H_1)/\f(\data|H_0)$. This result is referred to as the Neyman-Pearson lemma, and I will give an informal proof. We will prove this by considering a small variation to the acceptance region defined by the likelihood ratio. The solid red contour in Fig.~\ref{fig:neymanpearson} represents the rejection region (the complement to the acceptance region) based on the likelihood ratio and the dashed blue contour represents a small perturbation. If we can say that any variation to the likelihood ratio has less power, then we will have proved the Neyman-Pearson lemma. The variation adds (the left, blue wedge) and removes (the right, red wedge) rejection regions. Because the Neyman-Pearson setup requires that both tests have the same size, we know that the probability for the data to be found in the two wedges must be the same under the null hypothesis. Because the two regions are on opposite sides of the contour defined by $ \f(\data|H_1)/\f(\data|H_0)$, then we know that the data is less likely to be found in the small region that we added than the small region we subtracted assuming the alternate hypothesis. In other words, there is less probability to reject the null when the alternate is true; thus the test based on the new contour is less powerful.