[section] [theorem] [theorem]Lemma [subsection]

As we know, the learner will approximate the target with \(y \approx \hat{y} = h\left(\sum_{j}\theta_j x^j\right)\). The first idea led to find the hyperplane that best separates the two classes by estimating parameter \(\hat{\theta}\). Given that we have now a problem of \(p\) degrees of freedom, what is left is to find the parameter by optimizing a certain criteria. This choice will certainly depend on the way we decide that one parameter is better than another. For example the choice of \(0.5\) as a threshold in \ref{formula:logitThreshold} is ad-hoc and certainly one which we could try to fit in the optimization process. In the context of machine learning, functions that define a quantitative measure of a parameter’s performance are called a loss functions. A naive approach to find the parameters in our problem would be to minimize the residual sum of squares. In the context of linear regression with gaussian residuals, minimizing the residual sum of squares is equivalent to maximizing the likelihood probability of the target, given the input data. This results in minimizing the following sum (RSS): \[\label{eq:rss} RSS(\theta_0,..,\theta_p) = \sum_{i=1}^n [y_i - \hat{y}_i]^2 \\ = \sum_{i=1}^n [y_i - h( \theta \cdot x_i)]^2\]

The equation reflects our goal to correctly match a training sample with their targets. But in truth we are interested in having a generalized model, one which can make a good prediction on any sample, even new ones. The previous equation is a bad attempt to generalize the classification model constructed from the data. It is built to fit the actual dataset but we wouldn’t expect it to perform well with any other given sample of the true distribution for \(\mathrm{T}\).

Prediction Error Given a choice of parameter \(\theta\), the prediction error for the resulting classifier \(f_\theta\) is

\[Err_{true}(\theta) = {{\mathbb{E}}}_{ \mathrm{T}}[(\textbf{y} - h(\textbf{x} \cdot \theta) )^2] \\ = \int (y - h(x \cdot \theta) )^2 P(x,y)dxdy\]

Note that integration here is done over the joint distribution of inputs and outputs. In practical cases, we have incomplete information on \(P(x,y)\) given the finite data sample. We must assume then that calculating this integral is not feasible for any \(\theta\) and must rely on estimation procedures.

For example, we could try and sample \(M\) iid points from \(P(x,y)\) to approximate the integral by a Monte Carlo scheme such as

\[\label{eq:mcarlo-approx} Err_{true}(\theta) \approx \frac{1}{M} \sum_i^M ( y - h(x \cdot \theta) )^2\]

However this would be unfeasible as well, since the sampling process should be done for each specific \(\theta\). Notice however the close resemblance of this equation \ref{eq:mcarlo-approx} to the form in \ref{eq:rss}.

In classification problems, the residual sum of squares is not the loss function used to estimate the modelś parameters. Instead, they rely on other loss functions that we will introduce later.

For now, let \(f: X \rightarrow Y\) a function mapping feature space to the target space and assume that \(y = f(x) + \epsilon \) is a good relationship for our data, where \(\epsilon \sim {\mathcal{N}}(0,1) \) then the equation in \ref{eq:rss} can be read as the parameter error given the training set.

Our interest is now in minimizing the error \[Err_{train}(\theta) \approx \frac{1}{n} {{\mathbb{E}}}_{ \mathrm{T}}[(y - h(x \cdot \theta) )^2]\]