Authorea

[section] [theorem] [theorem]Lemma [subsection]

A working example

For the purpose of this work we will be talking of the training set, noted by \(\mathrm{T}\) as the set of data examples. The objective is to build a probabilistic model which has the capacity to predict correctly the class instance of new data objects based on having seen information of other data objects.

For concreteness, let’s consider a reduced dataset built from Call Detail Records (CDRs) where samples are calls being made by users who can belong to any of following provinces : Buenos Aires, Cordoba and Santa Fe. Five measurements were made on all of the observations to account for the user’s number of calls and total duration of calls during a week of measurements. A short example of this dataset can be seen below:

\label{tab:sample_CDR}

Head of the CDR dataset
User	CallsWeekend	TimeWeekend	CallsWeekDays	TimeWeekday	Province
BA343E	15	89	8	24	Santa Fe
73F169	10	121	2	98	Cordoba
EA23AD	12	43	5	154	Buenos Aires

In this form a row is representing the available data acquired from each acquired and columns represent types of measurements or information. In general, most machine learning problems will be associated with a training set \(\mathrm{T}\) of similar form as the one shown before. Where rows represent objects or samples and columns are measurements or features of our samples. Here \(\mathrm{T}\) will take the form of a paired couple of datasets \((X,Y)\) where \(X \in \mathbb{R}^{n \ x \ p}\) and \(Y\) is, in general, a real valued vector of length \(n\). The \(jth\) column of \(X\) or equivalently the \(jth\) feature of the data is denoted by \(X^j\). Similarly, the \(ith\) row or sample of \(X\) is denoted by \(X_i\) or \(x\) when the superscript is understood from context. A similar notation is used with the outputs, where \(Y_i\) of \(y\) will be used to denote the target of a specific sample depending on context.

In this particular case, even though the last column of the dataset in \ref{tab:sample_CDR} is not a measurement per se, it gives information on each user’s province of residence. From here, there are various questions one could try to answer. Examples of problems that a machine learning algorithm could tackle from this data could be:

Predict a user’s province when given information on only the first four measurements.
Predict a user’s number of calls made on weekends when given information on the last four measurements.
Give an estimate of the probability density function for a user’s calling time during weekdays.

The first two problems are examples of supervised learning where the \(Y\) variables or labels are the last and first features (columns) respectively. Even more, the first problem is a classification one since users are to be classified according to one of the three possible provinces whilst the second one is a regression problem for which the output could be any of a range of numerical values. The labels in classification problems are numerically encoded with a finite range of numbers where \(\{0,1\}\) or \(\{-1,1\}\) are usually used for the binary case.

Note that there are no assumptions made here about the data which we take as is given. This is common in machine learning applications and because of this, algorithms tend to be designed to account for this situation. The type of problems and questions that could be done then depend entirely on the training set.

The last example problem in the list before, belongs to the unsupervised learning category where there is no need to have a label on the data. Here the question is on the structure of a specific column, namely the estimation of the true probability distribution of the calling time. In this setting there is no output expected for new samples.

To illustrate further the supervised classification scenario, a brief notation outline is described below following a standard logistic regression classifier example:

Let’s suppose that the problem is to determine the province of origin of a user. A classification algorithm should assign samples to provinces. Let \(\{C_1,..,C_k\}\) denote the set of possible target classes. We are interested in maximizing the probability of belonging to a certain class, given the phone data: \[P(C_k| X) = \frac{P(x|C_k)P(C_k)}{P(x)}\]

In this interpretation, \(P(C_k)\) is known as the prior probability of belonging to that certain class and \(P(C_k|x)\) as the posterior probability, given the data.

Our classifiers will partition input space into decision regions \(R_k\) for which a class \(C_k\) is uniquely associated to a region. It makes sense to try and minmize the chances of assigning samples to incorrect regions. For example, take a problem with \(K\) classes. Given a sample \(x_i\), the probability of a correct classification is measured by

\[\begin{split} = & \sum_{k=1}^{K} P(x_i \in R_k, C_k ) \\ = & \sum_{k=1}^{K} \int_{R_k}P(x_i,C_k) dx \\ = & \sum_{k=1}^{K} \int_{R_k}P(C_k \mid x_i) P(x_i) dx + \end{split}\]

Observe that the measure is completely characterized by the posterior probabilities. We observe that the factor \(P(x)\) is common to both integrals so we only need to maximize the posteriors. So even when \(P(x)\) might not be known or accurately estimated, the algorithm can only modify the decision regions to improve the probability of correctly classifying samples. The goal of the algorithm will then be to chose the best possible regions \(R_k\) for the problem.

A more reduced problem instance would be to determine if a user belongs to the province of Cordoba vs the rest of the provinces. This brings us to a binary learning problem. As a simple solution, we could build a learner \(f\) from a linear combination of the input features that will predict the target output. Ideally we would have that for every sample \(y = f(x) = h\left(\sum_{j}\theta^jx^j\right)\) \label{formula:1} ¹ where \(\theta\) is unknown and generally referred to as the coefficient or parameter.

If we let \( t_i = \theta \cdot x_i \ \forall \ i \),in the ideal case we will have that \(\exists \theta s.t. \forall i \) \[ t_i \begin{cases} &>0 \ \mbox{if} \ y_i=1 \\ &<0 \ \mbox{if} \ y_i=0. \end{cases} \] Then if we were to use the heavyside step function \(h(z)\) where

\[ h(z) = \begin{cases} &0 \ \mbox{if} \ z<0 \\ &\frac{1}{2} \ \mbox{if} \ z=0 \\ &1 \ \mbox{if} \ z>0. \end{cases} \]

If the goal of the model is to maximize \(P(Y_i = y_i | X_i = x) \ \forall i\) under the assumption that \(\nexists\ i \ s.t. \ t_i = 0\), the algorithm would then correctly classify all samples to their targets. However this situation is hardly the case, so we use a more common approach. A more tractable function for this task is the logistic function which is an approximation of the heavyside step function. Here

\[\label{eq:logisticFunction} \sigma(z) = \frac{e^{z}}{1 + e^{z}} = \frac{1}{1 + e^{-z}} \\ and \ H(z) = \lim_{k \to \infty} \left(\frac{1}{2} + \frac{1}{2}tanh(kz) \right) = \lim_{k \to \infty} \left(\frac{1}{1+e^{-kz}} \right)\]

The logistic function has the advantages of being smooth, well defined for all real numbers and its image is \((0,1)\). This property lends itself to reading outputs as probabilities of targets belonging to a certain class. The derivative of this function can be put in terms of the function itself, where

\[\label{eq:derivativeLogisticFunction} \sigma '(a) = \sigma(a)( 1 - \sigma(a) )\]

The logistic function is also bijective, with the inverse given by the logit function

\[\label{eq:logitFunction} \sigma^{-1}(z) = log( \frac{z}{1 - z})\]

For the scenario characterized in \ref{formula:1} we would have to finally assign each output to a specific class. A common approach for this is to categorize each output whether \(\hat{y} > 0.5\) \label{formula:logitThreshold}. Notice that having \(h(x \cdot \theta) = 0.5\) implies that \(x \cdot \theta = 0\) and thus our classifier is separating samples in feature space (the space of the inputs) with the hyper-plane defined by the parameter \(\theta\). Having a higher \(\hat{y}\) for a given sample implies that it is further away from the hyper-plane. The same goes for low estimated targets. If we were to read this as a probability, we can interpret that the algorithm is modeling the posterior probability \(P(Y_i = y | X_i = x)\). Here the model is said to be linear because it is linear in \(X\), implying that the decision boundary is linear too.

then it corresponds to having a higher confidence in the classification.

In formula \ref{formula:1} we have omitted the intercept parameter , generally noted as \(\theta_0\). The reason for this is that one can include this parameter implicitly if we allow for a synthetic feature \(X^0\) in \(X\) which is a vector with all of its components equal to 1.↩