[section] [theorem] [theorem]Lemma [subsection]

Logistic Regression for Classification

\label{section-logisticRegression}

This model is named as a regression model but is actually used for classification and this convention has been used in this way for historical reasons.

Let \(\{C_1,..,C_k\}\) denote the set of possible target classes. We are interested in maximizing the probability of belonging to a certain class, given the input data: \[P(C_k| X) = \frac{P(x|C_k)P(C_k)}{P(x)}\]

Logistic Regression Given a choice of parameter \(\theta\),

Logistic regression models the posterior probabilities of the classes with a transformation on a linear function of inputs. It also conditions that the posterior probabilities must sum to one. Here the probabilities describing the possible values for the target variable is modeled using a logistic function.

To satisfy this the log-odss ratios of all the classes with respect to a fixed class must all be linear in the inputs. This means that

\[log\big( \frac{P(C_i|x)}{P(C_j|x)}\big) = \theta_{i0} + \theta_i^\intercal x\] for \(i,j \in \{1,2,...,K\}, i\neq j\) \label{logit-logOddss}

With these same indices we have that

\[P(C_i|x) = \frac{ exp(\theta_{i0} + \theta_i^\intercal x)}{1 + exp(\theta_{j0} + \theta_j^\intercal x)}\]

Then the model is specified in terms of the log-odds for each class and is parameterized by \(\theta\).

This model sets the target as a Bernoulli random variable when conditioned on the input variables. Formally,

\[\begin{split} Y_i \mid X_i \ \sim & \operatorname{Bernoulli}(p_i) \\ \mathbb{E}[Y_i \mid X_i ] = & p_i \end{split}\]

The probability function of the target given the features \(\Pr(Y_i=y\mid X_i)\) is given by \[\Pr(Y_i=y \mid X_i = x_i) = p_i^{y} (1-p_i)^{(1-y)}\]\label{logit-probabilityDensity} where this depends on the class dependence of \(y\).

Here the logit function is used to map log odds into conditional probabilties and the model outputs the predicted probabilities of the target variables belonging to the target class.

In this way, we are specifying a model where the target is a linear function of the inputs, corrected by an error term: \[Y_i = I(\theta_0 + \theta \cdot X_i + \epsilon) \ \forall i\]. \label{logit-indicatorFunction}

where \(\epsilon\) is the error of the approximation and is distributed with the standard logistic distribution. If we take the conditional probability on \ref{logit-indicatorFunction}, given the features we will have that \[logit(p_i)= ln(\frac{p_i}{1-p_i}) = logit\big( {{\mathbb{E}}}[Y_i| X_i] \big) = \theta_0 + \theta \cdot X_i\]

From this equation we have once again that

\[p_i = \sigma(\theta \cdot X_i) = \frac{exp(\theta \cdot X_i) }{1 + exp(\theta \cdot X_i)}\]

where we have used an abuse of notation to absorb \(\theta_0\) into \(\theta\). Finally, if use this and plug it into the initial conditional probability of \(Y_i\) in \ref{logit-probabilityDensity} we will have

\[\Pr(Y_i=y \mid X_i = x_i) = p_i^{y} (1-p_i)^{(1-y)} = \frac{exp(y . \theta \cdot X_i) }{1 + exp(\theta \cdot X_i)}\]

As usual, ,aximum likelihood is a common method used to fit models. In this case, given a parameter \(\theta\), the probability of having a target vector \(y\) yields \[P(Y =y \mid \theta ) = \prod_{i=1}^N P(y_1 \in C_1 \mid x_i, \theta)^y_i(1 - P(y_1 \in C_1 \mid x_i, \theta) )^{(1-y_i)}\]

If we take into account that \(P(y=1 \mid x,\theta) = 1 - P(y=0 \mid x,\theta)\) , then the estimation of \(\theta\) for \(N\) samples gives the following loss function

\[l(\theta) = \sum_{i=1}^N \big(y_i log(P(y_i \mid x_i,\theta)) + (1-y_i)log(1 - P(y_i \mid x_i,\theta) ) \big)\]

Note the advantage that the loss function is concave in the parameters. Another advantage of this model is that closed forms can be given for the gradient and the Hessian of the loss function. The negative gradient can be given in the following way: \[- \nabla l(\theta) = \sum_{i=1}^N (y_i - P(y_i \mid x_i,\theta))\cdot x_i = \textbf{X}^{\intercal}(\textbf{y}-\textbf{p})\]

whilst the second order derivatives take the following form:

\[\frac{\partial^2 l(\theta)}{\partial \theta \partial \theta^\intercal} = \sum_{i=1}^N x_i \cdot x_i^\intercal P(y_i \mid x_i,\theta)(1 -P(y_i \mid x_i,\theta))\]

Model Regularization and Hyper-parameters

\label{subsection-hyperParametersRegularization}

Model Regularization Regularization of models is the process in which restrictions and conditions are imposed to the model’s function \(f\) through a functional \( R(f)\). The regularization term is used in the optimization procedure through the loss function.

Regularization is commonly used in problems which are ill-posed either because the model is biased or overfit. In general, restrictions are put to reduce the number of predictors used in the model. This shrinkage of parameters can be forced by penalizing large values for \(\theta\) either by the number of non-null components of this parameter or by measuring its distance to zero. By doing so, we hope that only the most relevant features are selected in the model. As we will see in \ref{section-VcDimension}, a shorter model has the benefit of being more accurate in the estimation for the prediction error. In other cases, the advantage of a reduced model is that the variance of the model is reduced, only with a slight increase in bias. A slightly more heuristic argument for model parsimony is that as such,all models will approximate natural up to a certain level if two models have the same predictive power, the simpler model should be used.

There are a number of different functionals \(R(\cdot)\) used to impose conditions on the model. The most common ones include introducing a penalty to the size of \(\theta\) by measuring a given norm \(\| \theta \|\).

Each has its own advantages and the details exceed the nature of this work. As a summary, benefits are related to robustness of the solution, convexity of the minimization function, unique minimum and model parsimony. The former case is known as Lasso regularization and the latter is known as ridge or Tikhonov regularization. Other variants include the elastic-net penalty which is a weighted average of the previous two norms, or the \(l0\) norm, which is the number of non null components of \(\theta\).

and taking a weighted average of two different penalizations. T

Even though some regularization methods have closed-form solutions, in practice most of them are found by iterative optimization routines such as gradient descent.

The following example shows the logistic regression’s model loss function with an \(l2\) regularization term on \(\theta\):

\[\label{logitRegularization} \begin{split} \min_{\theta} & \sum_{i=1}^N \(y_i ( \theta \cdot x_i ) + log(1 + e^{\theta \cdot x_i} ) \) + \lambda\| \theta\|_{2}^2 \\ \min_{\theta} & \sum_{i=1}^N \(y_i ( \theta \cdot x_i ) + log(1 + e^{\theta \cdot x_i} ) \) + \lambda \sum_{j=1}^{P} \theta_j^2 \end{split}\]

The value \(\lambda\) acts here as the weight that our optimization procedure will put to the regularization part of the minimization function and \(P\) is the number of features used in the model. Note that in an abuse of notation, the functional takes the form \(R(f) = \lambda\| \theta\|_{2}^2\) where the model is identified by its \(\theta\) parameter.

In this work we will be only considering model regularization with lasso, ridge, or elastic-net.

Note that in the model built from equation \ref{logitRegularization} there are two specific parameters which must be predefined before starting the optimization procedure. Notably \(\lambda\) and the \(l2\) norm used to measure the size of \(\theta\). These values are said to be hyper-parameters of the models since they are not directly part of the theoretical model. Also, some authors in the literature can refer to the hyper-parameters of the model as tuning parameters.

The hyper-parameters are the values set to configure the different possible variants in the loss function and in the model’s training phase. They are values that directly affect the estimated fit \(f_{\hat{\theta}}\) but are not the \(\theta\) parameter themselves. They need to be instantiated before training the learner and as such, they can’t be learned from the dataset.

Note that in a Bayesian context the definition of hyper-parameter is different. These appear in the prior distributions of the model’s parameters and should not be confused. In this case, the parameter’s of \(\theta\)’s distribution are called the hyper-parameters.