[section] [theorem] [theorem]Lemma [subsection]

Bias, Variance, Generalization and Model Complexity

\label{section-biasVariance}

In supervised learning settings, the goal is to build a learner with low predictive error. The algorithm’s generalization power will then lie on its ability to correctly label samples on independent test data which is has not trained on. Measuring and comparing this ability among models is done to select the best supervised estimator. The generalization error is also known as the prediction error. This is the model’s expected error over an independent training set \(\mathrm{Ts}\). In the literature such as \cite{james-biasVarianceGeneral} authors point to two sources responsible for this error, namely the bias and the variance of the learner. And the improvement of one generally leads to a hinder of the other. These errors are central to the prediction error because they point to different weak points in the algorithms. This will result in different ways to tackle each error and controlling both is central to any supervised learning problem.

Conceptually the first type of error relates to the model’s accuracy in labeling predictions correctly. This could be either by assigning the sample to its correct class in classification settings or by predicting correctly the sample’s target value in a regression setting. It is lower when models correctly learns the underlying structures of the data. However, as models learn more from the available data i.e. the training set, they lose the ability to extend this predictive accuracy to new samples because they overfit on the samples available. In this situation we have another type of error which is known as an increase in variance.

In the generalization error, the loss function determines how good the model’s prediction are by quantifying these errors. Given a training set \(\mathrm{T} = (X,Y)\), let \(\hat{f}(X)\) be the model’s predictive function for the features and denote \(L( Y,\hat{f(X)} )\) the loss function acting on the data.

By comparing for each sample the target predictions versus their real values, loss functions give a measure of how strong classifiers are with respect to their predictions. These can be characterized as semimetrics on the target space i.e. functions that are symmetric, non-negative and equal to zero only if both arguments are equal.

Prediction Error \[PE(\hat{f})= {{\mathbb{E}}}_{\textbf{X},\textbf{Y}} \left[ L(Y,\hat{f}(X))\right]\]

This is the expected error between our model and the target labels, as quantified by the loss function.

Given that any model is always built on a dataset, we need to estimte the prediction error using the model’s error over the data. This value is the generalization error.

Generalization Error \(\mathrm{Ts}\): \[Err_{\mathrm{T}} = {{\mathbb{E}}}_{\textbf{X},\textbf{Y}} \left[ L(Y,\hat{f}(X)) | \mathrm{T}\right]\]

The generalization error is also referred to as the training prediction error. The prediction error is a value very much related to the generalization error as it’s its expectation: \[PE = {{\mathbb{E}}}_{X,Y} \left[ L(Y,\hat{f}(X))\right] = {{\mathbb{E}}}\left[ Err_{\mathrm{T}} \right]\].

Historically the errors due to bias and variance where associated directly with the squared loss function which is defined as \(L(z,w) = \left\Vert z-w \right\Vert^2_2\). This function was favored over other functions because of its analytical advantages. In this case, the prediction error is

\[\label{squaredPE} PE = {{\mathbb{E}}}_X {{\mathbb{E}}}_{Y|X} \left[ \left\Vert Y - \hat{f}(X) \right\Vert_2^2 \right]\]

It is not difficult to see that the model that minimizes this error is \(\hat{f}(x) = {{\mathbb{E}}}\left[ Y | X=x \right] \). Suppose now that there exists a true functional relation \(Y = f(X) + \epsilon\). In this simple scenario, we take \(\epsilon\) to be the noise, with \(0\) mean and fixed \(\sigma^2\) variance. Our model \(\hat{f}\) will be an estimate of this true relation, constructed from the data.

With the squared loss, the prediction error \({{\mathbb{E}}}\left[ \left\Vert Y - \hat{f}(X) \right\Vert_2^2 \right]\) of this model can be decomposed in the following way:

\[\label{squaredBiasDecomposition} \begin{split} PE( \hat{f} ) = & {{\mathbb{E}}}_X \left[ f(X) - \hat{f}(X) \right]^2 + {{\mathbb{E}}}_X \left[ \hat{f}(X)^2 \right] \\ & - {{\mathbb{E}}}_X \left[ \hat{f}(X) \right]^2 + \sigma^2 \\ = & Bias(\hat{f})^2 + Var(\hat{f}) + \sigma^2 \end{split} \]

Equation \ref{squaredBiasDecomposition} is referred to as the bias-variance decomposition for the squared loss. The first term is called the square of the bias of the estimator. It measures how well off are our estimator’s predictions compared to the true relational function. On the other hand, the second and third terms are the variance of the estimator. This will measures how this random variable varies along its most expected non-random value. The noise’s variance term is that part of the prediction error which is irreducible. Note that we have already taken the expectation over the target and that is why we are left with the target’s noise. This part of the error we can not minimize or control with our learner. It is due to the random nature of the problem.

For more general loss functions other than the squared error, a generalized notion of bias and variance is explained in \cite{james-biasVarianceGeneral}. The author defines what properties should be displayed by the bias and variance of any estimator. And for this ween no te define the systematic value of a random variable with respect to a loss function.

Systematic Value Given a training set \(\mathrm{T}\), a loss function \(L(\cdot)\) and a a random variable \(Z\), the systematic value or systematic component of this variable is \[SZ = arg \ min_u {{\mathbb{E}}}_{Z} \left[ L(u,Z) \right]\]

This definition is implicitly assuming that the necessary finiteness conditions of the expectation of the loss function with the random variable exist. The systematic value is then the nearest constant value to the random variable, with the measure as given by the loss function.

In a general setting, the author argues that the bias should be the measure of the systematic difference in which a random variable differs from a particular target value and it also should be the measure of how much this systematic difference contributes to the error.

On the other hand the variance of an estimator should measure the spread of the estimator around its systematic component and it should also be the effect this variability has on the prediction. In this sense, the variance and bias of the learner to be:

\[\begin{split} & Bias(\hat{f})^2 = L(S\hat{f},SY)) \\ & Var(\hat{f}) = {{\mathbb{E}}}_{\hat{f}} \left[ L(\hat{f} , S\hat{f}) \right] \end{split}\]

Note that with this definitions the variance is an operator defined only for the estimator and which is null only when the predictor is a constant value for any training set. As such, it is unchanged by new data. The bias on the other hand is an operator built form the systemic values of \(Y\) and \(\hat{f}\) and is null only if both of this values are equal.

The generalized bias-variance decomposition now takes the following form, where the bias and variance of the squared loss are replaced with more general properties such as the variance effect and the systematic effect:

\[\begin{split} {{\mathbb{E}}}_{X,Y} \left[ L(\hat{f}(X),Y )\right] = & {{\mathbb{E}}}_{Y} \left[ L(SY,Y )\right] + \\ & {{\mathbb{E}}}_{Y} \left[ L(S\hat{f}(X),Y ) - L(SY,Y )\right] + \\ & {{\mathbb{E}}}_{Y,X} \left[ L(\hat{f}(X),Y ) - L(S\hat{f}(X),Y )\right] \end{split}\]

Note here that the first term is equivalent to the variability of the target variable with respect to its systematic value, the second term is the systematic effect. That is, the expected bias between the systematic values of \(Y\) and \(\hat{f}(X)\). The last term is the variance effect which is the expected variablity between \(Y\) and \(\hat{f}(X)\).

For the case of supervised classifiers, loss functions are also called metrics and they origin from Type I & Type II errors common in statistics but have different meanings in this context. There are a number of different variations of metrics where different problem settings might require different more suitable metrics to be used. Here \(Y\) will be taking any of the values in the class set \({\mathcal{G}}\) and we will denote \(K = |{\mathcal{G}}|\) as the number of classes. Since we are trying to correctly label samples, the most common loss function used in this context is \(L(Y, \hat{f}(X)) = I(Y \neq \hat{f}(X))\). Here the systematic values are \(SY = \arg max_{1\leq i \leq K} P(Y=i)\) and \(S\hat{f}(X) = \arg max_{1\leq i \leq K} P(\hat{f}(X)=i)\). It is also straightforward to see that \(Var(Y) = 1 - P(Y=SY)\) and \(Var(\hat{f}(X)) = 1 - P(X = S\hat{f}(X)) \).

With the above we have that the decomposition of the prediction error among variance, systematic and variance effect becomes \[Var(Y) + P(Y=SY) - \sum_{i=1}^K P(\hat{f}(X) =i)P(Y=i)\] which again relies heavily on the variability of the target and on how well the classifier approximates the labels.

In this setting the classification prediction error will be as \[\label{classificationEPE} EPE = {{\mathbb{E}}}_X\left[ \sum_{k=1}^{K} L( Y_K , \hat{f}(X) ) P(Y_k|X) \right]\]

In practice we will rely heavily on the training error of the model. This will be our approximated prediction error through the

Training Error is the average loss over the training set \[\overline{err} = \frac{1}{N} \sum_{i=1}^N L(y_i, \hat{f}(x_i) )\]

Let x \(\in \mathbb{R}^{p}\) denote a random input variable and y \(\in \mathbb{R}\) denote a random output variable with joint distribution \(P\left(\textbf{x},\textbf{y}\right)\).

We will then look to evaluate models according to their performance in these two errors in training and testing datasets. Combinations of high or low values for these, across training and test data, are used as model evaluation and highlight aspects to be improved.

For example, a very common scenario is having a model that has high overall variance and low bias. It is said that a learner overfits a training set \(\mathcal{T}\) if it fits parameter \(\hat{\theta}\) while there \(\exists \theta^*\) such that \[\label{eq:overfitting} Error_{train}(\hat{\theta}) < Error_{train}(\theta^*) \ and \ Error_{test}(\theta^*) < Error_{test}(\hat{\theta}) \]

Once the model was fit on the training set, it overfits when predictions on training samples are very accurate whilst predictions on new samples spread out and raise the value of the error. The model has learned very well from the training data but fails to be accurate on new samples.

Given that we can have a high or low variance and bias, there are only four possible performance for any model. Our objective will then be to work out the data and the model to reach a state of low bias and variance. Different states of bias and variance require strategies to make a model better.

Note that in this definition the expectation is taken conditional on the training set and also from which the model is fit.↩