Kyle Cranmer deleted file sectionConceptual_bu.tex  about 9 years ago

Commit id: 3f2727580d01ef0f6d76a2a0d7e17d8a064cd7e2

deletions | additions      

         

\section{Conceptual building blocks for modeling}  \subsection{Probability densities and the likelihood function}  This section specifies my notations and conventions, which I have chosen with some care.%  \int f(x) \;dx\;= 1\;.  Figure~\ref{fig:hierarchy} establishes a hierarchy that is fairly general for the context of high-energy physics. Imagine the search for the Higgs boson, in which the search is composed of several ``channels'' indexed by $c$. Here a channel is defined by its associated event selection criteria, not an underlying physical process. In addition to the number of selected events, $n_c$, each channel may make use of some other measured quantity, $x_c$, such as the invariant mass of the candidate Higgs boson. The quantities will be called ``observables'' and will be written in roman letters e.g. $x_c$. The notation is chosen to make manifest that the observable $x$ is frequentist in nature. Replication of the experiment many times will result in different values of $x$ and this ensemble gives rise to a \emph{probability density function} (pdf) of $x$, written $f(x)$, which has the important property that it is normalized to unity  f(x | \alpha) \;,  In the case of discrete quantities, such as the number of events satisfying some event selection, the integral is replaced by a sum. Often one considers a parametric family of pdfs  read ``$f$ of $x$ given $\alpha$'' and, henceforth, referred to as a \emph{probability model} or just \emph{model}. The parameters of the model typically represent parameters of a physical theory or an unknown property of the detector's response. The parameters are not frequentist in nature, thus any probability statement associated with $\alpha$ is Bayesian.\footnote{Note, one can define a conditional distribution $f(x|y)$ when the joint distribution $f(x,y)$ is defined in a frequentist sense.} In order to make their lack of frequentist interpretation manifest, model parameters will be written in greek letters, e.g.: $\mu, \theta, \alpha, \nu$.%  From the full set of parameters, one is typically only interested in a few: the \emph{parameters of interest}. The remaining parameters are referred to as \emph{nuisance parameters}, as we must account for them even though we are not interested in them directly.  While $f(x)$ describes the probability density for the observable $x$ for a single event, we also need to describe the probability density for a dataset with many events, $\data = \{x_1,\dots,x_{n}\}$. If we consider the events as independently drawn from the same underlying distribution, then clearly the probability density is just a product of densities for each event. However, if we have a prediction that the total number of events expected, call it $\nu$, then we should also include the overall Poisson probability for observing $n$ events given $\nu$ expected. Thus, we arrive at what statisticians call a marked Poisson model,  \begin{equation}  \label{Eq:markedPoisson}  \F(\data|\nu,\alpha) = \Pois(n|\nu) \prod_{e=1}^n f(x_e|\alpha) \; ,  \end{equation}  where I use a bold $\F$ to distinguish it from the individual event probability density $f(x)$. In practice, the expectation is often parametrized as well and some parameters simultaneously modify the expected rate and shape, thus we can write $\nu\rightarrow\nu(\alpha)$. In \texttt{RooFit} both $f$ and $\F$ are implemented with a \texttt{RooAbsPdf}; where \texttt{RooAbsPdf::getVal(x)} always provides the value of $f(x)$ and depending on \texttt{RooAbsPdf::extendMode()} the value of $\nu$ is accessed via \texttt{RooAbsPdf::expectedEvents()}.  \cancelto{\mathrm{Not ~True!}}{\int L(\alpha) \;d\alpha = 1}\; .  The \emph{likelihood function} $L(\alpha)$ is numerically equivalent to $f(x|\alpha)$ with $x$ fixed -- or $\F(\data|\alpha)$ with \data\ fixed. The likelihood function should not be interpreted as a probability density for $\alpha$. In particular, the likelihood function does not have the property that it normalizes to unity  It is common to work with the log-likelihood (or negative log-likelihood) function. In the case of a marked Poisson, we have what is commonly referred to as an extended maximum likelihood~\cite{Barlow1990496}  \begin{eqnarray}  -\ln L( \alpha) &=& \underbrace{\nu(\alpha) - n \ln \nu(\alpha)}_{\rm extended~term} - \sum_{e=1}^n \ln f(x_e) + \underbrace{~\ln n! ~}_{\mathrm{constant}}\; .  \end{eqnarray}  To reiterate the terminology, \emph{probability density function} refers to the value of $f$ as a function of $x$ given a fixed value of $\alpha$; \emph{likelihood function} refers to the value of $f$ as a function of $\alpha$ given a fixed value of $x$; and \emph{model} refers to the full structure of $f(x|\alpha)$.  Probability models can be constructed to simultaneously describe several channels, that is several disjoint regions of the data defined by the associated selection criteria. I will use $e$ as the index over events and $c$ as the index over channels. Thus, the number of events in the $c^{\rm th}$ channel is $n_c$ and the value of the $e^{\rm th}$ event in the $c^{\rm th}$ channel is $x_{ce}$. In this context, the data is a collection of smaller datasets: \mbox{$\datasim=\{\data_1, \dots, \data_{c_{\rm max}}\}=\{\{x_{c=1,e=1}\dots x_{c=1,e=n_c}\}, \dots \{x_{c=c_{\rm max},e=1}\dots x_{c=c_{\rm max},e=n_{c_{\rm max}}} \}\}$}. In \texttt{RooFit} the index $c$ is referred to as a \texttt{RooCategory} and it is used to inside the dataset to differentiate events associated to different channels or categories. The class \texttt{RooSimultaneous} associates the dataset $\data_c$ with the corresponding marked Poisson model. The key point here is that there are now multiple Poisson terms. Thus we can write the combined (or simultaneous) model   \begin{equation}  \label{Eq:simultaneous}  \F_{\textrm{sim}}(\datasim|\alpha) = \prod_{c\in\rm channels} \left[ \Pois(n_c|\nu(\alpha)) \prod_{e=1}^{n_c} f(x_{ce}|\alpha) \right] \; ,  \end{equation}  remembering that the symbol product over channels has implications for the structure of the dataset.