\documentclass[]{article}
\usepackage[utf8]{inputenc}
\usepackage[a4paper, total={6in, 8in}]{geometry}
\usepackage{graphicx}
\graphicspath{{images/}}
\usepackage{listings}
\usepackage{mdframed}
\lstset{
xleftmargin=.2\textwidth,
xrightmargin=.2\textwidth,
captionpos=b,
mathescape
}
\usepackage{amsmath} % Used to display matrices.
\usepackage{amssymb} % Used to display real numbers.
\title{Machine Learning Week}
\author{Duncan Attard \thanks{adapted from the online notes by Dr. Andrew Ng.}}
\date{November 2014}
\begin{document}
\maketitle
\section{Introduction}
\subsection{What is Machine Learning?}
Two definitions of Machine Learning are offered. Arthur Samuel described it as ``The field of study that gives computers the ability to learn without explicitly being programmed''. This is an older, informal definition.
A more recent definition by Tom Mitchell defines Machine learning as follows: ``A computer program is said to learn from experience \textit{E} with respect to some class of tasks \textit{T} and performance measure \textit{P}, if its performance at tasks in \textit{T}, as measured by \textit{P}, improves with experience \textit{E}''.
For example, if we are playing checkers:
\begin{itemize}
\item E: the experience of playing many games of checkers.
\item T: the task of playing checkers.
\item P: the probability that the program will win the next game.
\end{itemize}
\subsection{Supervised Learning}
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Supervised learning problems are categorized into \textbf{regression} and \textbf{classification} problems. In a \textbf{regression} problem, we are trying to predict results with a \emph{continuous} output, meaning that input variables are being mapped to some \textit{continuous} function. In a \textbf{classification} problem, we are instead trying to predict results in a \textit{discrete} output. That is to say, we are mapping input variables into \textit{discrete} categories.
Example:
\begin{description}
\item[Regression] ``Given data about the size of houses on a real estate market, try to predict their price''. Price as a function of size is a \textit{continuous} output, so this is a regression problem.
\item[Classification] To turn the same example into a classification problem, we could reword it as follows: ``Given data about the size of houses on a real estate market, does this house sell for more or less than the asking price?''. Here we are classifying houses based on price in two \textit{discrete} categories.
\end{description}
\subsection{Unsupervised Learning}
Unsupervised learning on the other hand, allows us to approach problems with little or no idea to what our results should look like. We can derive structure from data we don't necessarily know the effect of the variables.
We can derive this structure by for example \textit{clustering} the data based on relationships among the variables in the data. With unsupervised learning, there is no feedback based on the prediction results. In other words, there is no teacher to correct the algorithm, hence the term unsupervised. Another example of unsupervised learning is \textit{associative memory}.
For example:
\begin{description}
\item[Clustering] ``Given a collection of 1000 essays written for the US economy, find a way to automatically group these essays into a small number of groups that are somehow similar''. These could be related by different variables, such as word frequency, sentence length, page count, and so on.
\item[Associative] ``Given a doctor having years of experience, he is able to form associations in his mind between patient characteristics and the illnesses they have. If a new patient is diagnosed, then, based on this patient's characteristics such as symptoms, family medical history, physical attributes, mental outlook, etc., the doctor is able to associate possible illnesses based on what he has seen before on similar patients''. This is not the same as rule based reasoning in expert system. In this case, we would like to estimate a mapping function from patient characteristics to illnesses.
\end{description}
\section{Linear Regression with One Variable}
\subsection{Model Representation}
Recall that in \textit{regression} problems, we are taking the input variables and mapping the output onto a \textit{continuous} expected result function. Regression problems arise in two flavors:
\begin{description}
\item[Univariate Linear Regression] or linear regression with \textit{one} variable. This is used when we want to predict a single output from a \textit{single} input value.
\item[Multivariate Linear Regression] or linear regression with \textit{multiple} variables. This is used when we want to predict a single output from \textit{multiple} input values.
\end{description}
Remember that since we are doing \textit{supervised learning}, we already have an idea what the input and output cause and effect should be.
\subsection{The Hypothesis Function}
The \textit{hypothesis} function is used to predict output values, and has the general form:
$$h_{\theta}(x) = \theta_0 + \theta_1x$$
We give to the function $h_{\theta}(x)$ values for $\theta_0$ and $\theta_1x$ as inputs, in order to get to the output `y'. In other words, we are trying to come up with a function $h_{\theta}(x)$, which is able to map the input data (the $x$'s) to the output data (the $y$'s).
Example:
\begin{center}
\begin{tabular}{c|c}
\hline
$x$ (input) & $y$ (output)\\
\hline
0 & 4\\
1 & 7\\
2 & 7\\
3 & 8\\
\hline
\end{tabular}
\end{center}
Now, supposing we make a random guess about $h_{\theta}(x)$, and plug in the values $\theta_0 = 2$ and $\theta_1x = 2$, then the hypothesis function becomes $h_{\theta}(x) = 2 + 2x$. Using this hypothesis with an input value of $x = 1$, $h_{\theta}(x)$ returns 4. According to the table above, this is off by 3.
\subsection{The Cost Function}
We can measure the accuracy of our hypothesis function using what is known as a \textbf{cost function}. The cost function takes an average (actually a fancier version of an average) of all the results of the hypothesis with inputs $x$ compared to the actual outputs $y$. Mathematically:
\begin{equation}
J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})^2
\label{eq:linear-regression-j}
\end{equation}
Simply put, the cost function is $\frac{1}{2}\bar{x}$ where $\bar{x}$ is the mean (or average) of the squares of $h_\theta(x^{(i)}) - y^{(i)}$, or the difference between the predicted value and the actual value. This function is also called the \textit{squared error function} or \textit{mean squared error}. in the cost function, the mean is halved as a convenience for the computation of gradient descent (see below), as the derivative term of the square function will cancel out the $\frac{1}{2}$ term.
Now, armed with the cost function $J(\theta_0,\theta_1)$, we are able to measure the accuracy of the predictor function against the correct results we have. This enables us to predict results we don't have.
\subsection{Gradient Descent}
Now that we have the hypothesis function and a way to measure how accurate it is, we need an automatic way of improving the hypothesis function. One such method which allows us to improve the hypothesis accuracy in a stepwise fashion is the \textbf{gradient descent} algorithm.
\begin{figure}[t]
\begin{mdframed}
\centering
\includegraphics[scale=0.6]{gradient-descent}
\caption{Plotting the cost function $J(\theta_0,\theta_1)$ for $\theta_0$ and $\theta_1$.}
\label{fig:gradient-descent}
\end{mdframed}
\end{figure}
Before tackling what gradient descent is doing, we will first visualise our problem, the better to understand it. Plotting the cost function $J(\theta_0,\theta_1)$ for the parameters $\theta_0$ and $\theta_1$ yields the surface seen in Figure~\ref{fig:gradient-descent}. This plot is the result of the combination of the parameters $\theta_0$ and $\theta_1$ used in the cost function. More intuitively, we are plotting the guesses of the hypothesis function, placing $\theta_0$ on the $x$ axis, $theta_1$ on the $y$ axis, and the cost function associated with these parameters on the vertical $z$ axis. Any point on the plotted surface identifies a result computed by the cost function for the particular values of $\theta_0$ and $\theta_1$. That is to say, this cost (computed using Equation~\ref{eq:linear-regression-j} on page~\pageref{eq:linear-regression-j}) is the one resulting from using our hypothesis $h_{\theta}(x)$ with those specific parameter values.
The gradient descent algorithm tries to \textit{minimize} the cost function, and in doing so, will choose the optimal values for $\theta_0$ and $\theta_1$, that these may be used by our hypothesis function. Referring to Figure~\ref{fig:gradient-descent}, and keeping in mind that the surface visualises the cost or error $h_{\theta}(x)$ is making when predicting some $y$ value, it is natural to conclude that the least the error, the better. The least error in the plot above can be found in the bottom of the surface. In other words, finding the lowest possible point on the bottom of the plot will yield the optimal values for parameters $\theta_0$ and $\theta_1$.
Gradient descent takes the \textit{derivative} or line tangential to our cost function. The slope or gradient of this tangent is essentially the derivative at that particular point, and this will give us the direction to move towards in order for us to reach the minimum. This process is done iteratively for a number of times, and on each iteration, we are refining our values of $\theta_0$ and $\theta_1$ in a stepwise fashion. The rate by which $\theta_0$ and $\theta_1$ are refined is controlled by a parameter $\alpha$ known also as the \textit{learning rate} (see Listing~\ref{lst:gradient-descent}).
\begin{mdframed}
\begin{lstlisting}[caption={The gradient descent algorithm},label={lst:gradient-descent}]
Repeat until convergence {
$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)$ for $j = 0$ and $j = 1$
$(\textup{Intuitively: } \theta_j := \theta_j - \alpha[\textup{Slope of tangent (i.e. derivative)}])$
}
\end{lstlisting}
\end{mdframed}
\subsection{Gradient Descent for Linear Regression}
The algorithm defined in Listing~\ref{lst:gradient-descent} above is a general purpose one. When applied to the case of linear regression, gradient descent must be appropriately tweaked, and a new form is derived. In addition, we can substitute the cost function and hypothesis function, and modify the equation to the one shown in Listing~\ref{lst:gradient-descent-linear-regression}. The derivation of the formulas are out of the scope of this course. Suffice it to say that they are obtained using the pertial derivatives of the cost function with respect to $\theta_0$ and $\theta_1$.
\begin{mdframed}
\begin{lstlisting}[caption={The gradient descent algorithm adapted for linear regression},label={lst:gradient-descent-linear-regression}]
Repeat until convergence {
$\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})$
$\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i = 1}^m[(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}]$
}
\end{lstlisting}
\end{mdframed}
Where $m$ is the size of the training set. $\theta_0$, a constant that will be changing simultaneously with $\theta_1$, and $x^{(i)}$ and $y^{(i)}$ are values from the given training set. Note that we have two separate cases for $\theta_j$, one where $j = 0$ and the other, when $j = 1$. In particular, observe that for $\theta_1$ we are multiplying by $x^{(i)}$ at the end, due to the partial derivative.
Using this gradient descent algorithm we are in the position of choosing a guess for our hypothesis, and then by repeatedly applying gradient descent steps, we are refining our guess, until the overall cost is at its minimum (see Figure~\ref{fig:gradient-descent}). In doing so, our hypothesis with the chosen values of $\theta_0$ and $theta_1$ will become gradually more accurate.
\subsection{What's Next?}
Instead of using linear regression with one input variable, we'' generalize and expand our concepts so that we can predict data with multiple input variables. Also, we'll solve for $\theta_0$ and $\theta_1$ exactly, without the need of iterative algorithms like gradient descent.
\section{Linear Algebra Review}
\subsection{Matrices and Vectors}
Matrices are two dimensional arrays like so:
$$
\begin{bmatrix}
a & b & c\\
d & e & f\\
g & h & i\\
j & k & l
\end{bmatrix}
$$
The above matrix has four rows and three columns, so it's a $4 \times 3$ matrix. A vector is a special kind of matrix with many rows but just one column:
$$
\begin{bmatrix}
w\\
x\\
y\\
z
\end{bmatrix}
$$
\subsubsection{Notation and terms}
\begin{itemize}
\item $A_{ij}$ refers to the element in the $i$th row and $j$th column of matrix $A$.
\item A vector with $n$ rows is referred to as an $n$-dimensional vector.
\item $v_i$ refers to the element in the $i$th row of the vector.
\item In general, all vectors and matrices used in this text will be 1-indexed.
\item Matrices are usually denoted by uppercase names, while vectors by lower case names.
\item \textit{Scalar} means that an object is a single value, not a vector or matrix.
\item $\mathbb{R}$ refers to the set of scalar real numbers.
\item $\mathbb{R}^n$ refers to the set of $n$-dimensional vectors of real numbers.
\item $\mathbb{R}^{n\times m}$ refers to the set matrices of real numbers of dimensions $n \times m$.
\end{itemize}
\subsection{Addition and Scalar Multiplication}
Addition and subtraction are \textbf{element-wise} operations, so we simply add or subtract each corresponding element. To be able to do so, the dimensions of the matrices participating in the addition or subtraction must be the same:
$$
\begin{bmatrix}
a & b\\
c & d
\end{bmatrix}
+
\begin{bmatrix}
w & x\\
y & z
\end{bmatrix}
=
\begin{bmatrix}
a+w & b+x\\
c+y & d+z
\end{bmatrix}
$$
In scalar multiplication, we simply multiply every element by the scalar value:
$$
x \times
\begin{bmatrix}
a & b\\
c & d
\end{bmatrix}
=
\begin{bmatrix}
x \times a & x \times b\\
x \times c & x \times d
\end{bmatrix}
$$
\subsection{Matrix-Vector Multiplication}
We map the column of the vector onto each row of the matrix, multiplying each element and summing the result:
$$
\begin{bmatrix}
a & b\\
c & d\\
e & f
\end{bmatrix}
\times
\begin{bmatrix}
x\\
y
\end{bmatrix}
=
\begin{bmatrix}
a \times x + b \times y\\
c \times x + d \times y\\
e \times x + f \times y
\end{bmatrix}
$$
The result is a \textit{vector}. The vector must be the \textbf{second} term of the multiplication. Also, the number of \textit{rows} of the vector must equal to the number of \textit{columns} of the matrix. Thus, an $n \times m$ multiplied by a $m \times 1$ vector results in a $n \times 1$ vector.
\subsection{Matrix-Matrix Multiplication}
We multiply two matrices by breaking it into several vector multiplications and concatenating the result:
$$
\begin{bmatrix}
a & b\\
c & d\\
e & f\\
\end{bmatrix}
\times
\begin{bmatrix}
w & x\\
y & z\\
\end{bmatrix}
=
\begin{bmatrix}
a \times w + b \times y & a \times x + b \times z\\
c \times w + d \times y & c \times x + d \times z\\
e \times w + f \times y & e \times x + f \times z
\end{bmatrix}
$$
An $n \times m$ matrix multiplied to an $m \times o$ matrix results in an $n \times o$ matrix. In the above example, a $3 \times 2$ matrix multiplied by a $2 \times 2$ matrix resulted in a $3 \times 2$ matrix. To multiply two matrices, the number of \textit{columns} of the first matrix must match the number of \textit{rows} of the second matrix.
\subsection{Matrix Multiplication Properties}
The matrix multiplication operator has the following properties:
\begin{enumerate}
\item Not commutative, i.e. $A \times B \neq B \times A$.
\item Associative, i.e. $(A \times B) \times C = A \times (B \times C)$.
\end{enumerate}
The \textit{identity matrix}, when multiplied by any matrix of the same dimensions, results in the original matrix. It's just like multiplying numbers by 1. The identity matrix has 1's on the diagonal and 0's elsewhere:
$$
\begin{bmatrix}
1 & 0 & \cdots & 0\\
0 & 1 & 0 & 0\\
\vdots & 0 & \ddots & 0\\
0 & \cdots & 0 & 1
\end{bmatrix}
$$
When multiplying the identity matrix after some matrix, the square identity matrix should match the other matrix's \textbf{columns}; when multiplying the identity matrix before some other matrix, the square identity matrix must match the matrix's \textbf{rows}.
\subsection{Inverse and Transpose}
The \textbf{inverse} of a matrix $A$ is denoted by $A^{-1}$. Multiplying a matrix by its inverse yields the identity matrix. A non-square matrix does not have an inverse. The inverse of matrices in Octave is computed using the \texttt{pinv(A)} function.
The \textbf{transposition} of a matrix is like rotating the matrix once clockwise, and then reversing it. That is, columns in the original matrix become rows in the transposed matrix. The \textbf{transpose} of a matrix $A$ is denoted by $A^T$. In other words, $A_{ij} = A^T_{ji}$
$$
A =
\begin{bmatrix}
a & b\\
c & d\\
e & f
\end{bmatrix} \qquad
A^T =
\begin{bmatrix}
a & c & e\\
b & d & f
\end{bmatrix}
$$
\end{document}