Authorea

[section] [theorem] [theorem]Lemma [subsection]

Cross Validation

: There are a number of approximating functions or algorithms to be used in classification problems and each of these have different configurations or setups that, when fitted, produce different learners. The differences among the configurations are controlled by the parameters of the algorithm. The nature of each parameter might stem from computation or statistical variants in the algorithm used. One common example of this is the \(\lambda\) penalization parameter in logistic regressions, where \(\lambda\) controls the amount of weight to be put in the regularization term of the minimized function.

The objective of Cross Validation is to systematically explore the different configurations of tuning parameters and decide which learner is better for the task at hand. By comparing the estimated generalization error of each model, where models vary accordingly with the values of the tuning parameters, a value is given and then used to select the best model in the class.

This technique is one of the most widespread to evaluate the generalization performance of a set of learners. Given a number of possible configurations or values for the tuning (hyper) parameters of an algorithm, we would like to decide which selection of these fit the best estimator, as measured by the generalization error. In most cases data is generally scarce. At the same time prediction error estimates are based on asymptotic or analytical results which hardly computable in pratice. Thus CV intends to prevent over-fitting by iteratively holding out a random part of the dataset and by measuring the predictive accuracy of the learner fit on data in-sample against new values from the hold-out part of the dataset. The accuracy measure here is weighted by the loss function, which must be selected beforehand.

In this procedure, a partition of the samples is called a fold. CV then partitions data into \(K\) random folds, where the number \(K\) has to be previously decided.¹ Let \(\gamma : \{1,..,N\} \mapsto \{1, .., K\}\) be a function mapping samples to folds. Without loss of generality, let \(\alpha\) be an index of the model’s hyper-parameters, where each distinct combination of the tuning parameters is identified by this index. Note that the domain of \(\alpha\) will vary with the type of approximating function o algorithm used to learn.

The CV algorithm now runs iterations over all of the folds, and takes one fold \(\gamma^{-1}(\{k\})\) to be the validation set, where \(k \in [1,...,K]\) is the indexer of the iteration. \(\hat{f}^{-k}\) will denote the fitted estimator on the training set with the \(k\)-fold hold out and its classification performance will be tested against the out of sample estimates. This means that for every sample in this \(k\)-fold, we measure the loss \(L(y_i, \hat{f}^{-\gamma(i)}(x_i))\) of the model’s prediction against the true target value.

Cross validation intends to estimate the expected out-of-sample error \({{\mathbb{E}}}\left[ L(Y, \hat{f}(X)) \right]\), when the model is tested against independent samples from the true distribution. For this reason, it is fundamental to ensure independence of the training set and the test set. Any transformations that must be done on the input data that jointly uses the input and output samples in the process must be done and learned only on the training set. This is because we mustn’t introduce information from our test set in the estimator as the model should never see the test data until we use it to evaluate our learner.

Tuning-parameters (hyper-parameters) Selection:

Let \(\mathcal{A} = [\alpha_0, \alpha_1,..., \alpha_l ]\) be a list of hyperparameter settings and \(\mathcal{K} =[1,..,K]\) a list of folds. A full K-Fold Cross Validation procedure takes the following form.

Initialize \(\mathcal{A}\) and \(\gamma(\cdot)\)

Note that during the loop, each sample’s prediction was tested on the model which was fitted without using that sample.

From the CV procedure it makes sense to choose a final model \(\hat{f}_\alpha\) with the lowest \(CV(\alpha)\) value among all of possible hyperparameters. However, following ideas detailed in \ref{section-VcDimension}, importance is also given to the complexity of the approximating function. A common rule of thumb is to favor models with a lower number of hyper parameters or number of features. In practice though, a class of approximating functions might have a defined complexity which is not analytically computable. Thus in some cases, crude heuristic estimates or common sense are used to estimate model complexity, without making use of theoretical arguments.

Choice of \(K\) parameter

Experimental results show the differences among CV routines used with varying \(K\) values. In synthetic and actual dataset , \cite{hastie-elemstatslearn} P. 243, have found that using a higher value for \(K\), relative to the size of the dataset, means having a bigger training fold since each left-out partition is only \(\frac{N}{K}\) in size. The results show models with good bias but high variance which is in accordance with the small size of the validation set. Note that having a higher value for \(K\) will effect to a higher computational burden since \(K\) estimators need to be fitted.

Having a lower \(K\) value means using a smaller training set. Then it is common to find models with lower variance and higher bias. Empirical results also tend to show that the \(EPE\) is overestimated. This is because having less available data to fit the model, implies having worse estimates from asymptotic results.

One last common choice for \(K\) is \(N-1\). This scenario is known as leave-one-out CV and is a very used format in problems where data arrives sequentially over time. Here, samples of the training set \(t_i = ( \boldsymbol{x_i} , \boldsymbol{y_i} )\) are accessible only in an order fashion. For these cases model evaluations are made against the new sample and the training set is updated with every arrival..²

A good survey and explanation of the drawbacks of the \(K\)-Fold CV estimator can be found in \cite{bengio-unbiasedCvEstimator}. The authors prove that there is no distribution-free estimator unbiased estimator of the \(K\)-Fold cross validation estimator’s variance. The theoretical arguments focus on the idea that error correlations between the training and validation sets are not taken into account by the CV procedure. These correlations are known and understood for the general case, but their estimation is not possible. As a consequence of this, comparison between different possible models is is hindered.

Empirical examples are built from synthetic datasets to show the shortcomings of the cross validation’s algorithm which might show high deviations from its central value. Some exceptions appear though, specifically where distribution-free bounds can be found for a class of approximating functions. But these cases are specific only for this class. In general these bounds are not known, so using an unknown deviation of the CV estimator will affect the evaluation of different models.

Consensus ³ is that in general CV is a good procedure to estimate the expected prediction error with the training set fixed, but not good for the prediction error, conditional on the training set \(\mathcal{T}\) fixed.

We assume here that a random sample of the initial dataset was already left out in order to test the model’s accuracy at the end. We refer to this set as the testing set.↩
Note that here we must assume the exchangeability of samples. This means that the distribution of the training set is not altered by a permutation of samples \(F(x_1,...,x_n ) = F(x_\sigma{1},...,x_\sigma{n})\) for any random permutation \(\sigma\) of the samples↩
\cite{hastie-elemstatslearn} P. 260↩