\documentclass[10pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\usepackage[round]{natbib}
\let\cite\citep
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\begin{document}
\title{Getting Started with~Machine Learning}
\author[1]{Shirin Mojarad}%
\affil[1]{Affiliation not available}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\sloppy
Chapter 1.1, introduction
\begin{itemize}
\tightlist
\item
what is ML
\item
different types of~
\end{itemize}
\par\null
Chapter 1.2,~\emph{Machine Learning Technical Overview~}
Chapter 1.3, \emph{Hands-On Machine Learning with Scikit Learn}
Chapter 1.4,~~\emph{Advanced Topics/flavors of Machine learning}
\emph{Appendix: mathematical interlude}
\par\null
Section 1,~\emph{Getting Started with}~\emph{Machine Learning,}
discusses the general framework for a machine learning project. In this
section, we will learn how to systematically approach a machine learning
problem. In addition, this section introduces machine learning common
terminology such as training and evaluating a model, and splitting data
into train and test sets.
\par\null
Section 2,~\emph{Setting Up the Environment}, introduces Jupyter
notebooks, to create and share your code. Jupyter is an~open-source web
application that allows you to create and share documents that contain
live code, visualizations and narrative text. We will use Google Colab
environment for this purpose.~Google Colab~is a free cloud service that
supports free GPU. We will introduce the concept of GPUs in later
chapters. You can develop deep learning applications using popular
libraries such as Keras, TensorFlow without installing Python or any
libraries on your local computer. This chapter also entails a sample
notebook that contains a simple machine learning model to get familiar
with creating models and testing them using Jupyter notebooks on Google
Colab environment.
\par\null
\subsection*{Introduction}
{\label{785190}}
\subsubsection*{what is machine learning}\label{what-is-machine-learning}
Machine learning is simply computers learning from and making
predictions on data using algorithms. The outcome of this learning is a
model that~is a function or collection of functions explaining the
patterns in data that enables the computer to make predictions or take
actions.~This implies that humans enable computers to learn without
explicitly programming them {[}1{]}. We need machine learning
algorithms~ to build models that represents data and explain the
variations and patterns in it. Hence, algorithms are central to machine
learning in extracting knowledge and insight from data.
\par\null
The term machine learning was coined in 1959 by Arthur Samuel, an
American pioneer in the field of artificial intelligence. His
Checkers-playing program is considered the first example of a machine,
learning an algorithm without being explicitly programmed, but by seeing
examples of previous checkers moves.
\par\null
\subsubsection*{Types of Machine Learning
Algorithms}\label{types-of-machine-learning-algorithms}
There are different types of ML algorithms, based on how they are
created and their applications. Understanding different types of ML
would help you getting a big picture of AI,~ understand what is the goal
of creating ML models and enable you~to break down a real problem and
design a machine learning system.
There some variations of how to define the types of Machine Learning
Algorithms but commonly they can be divided into categories according to
their purpose and the main categories are the following:
\begin{itemize}
\tightlist
\item
Supervised learning
\item
Unsupervised Learning
\item
Reinforcement Learning
\end{itemize}
\par\null
\subsubsection*{Supervised learning}\label{supervised-learning}
How it works: Supervised learning is the simplest form of learning mode,
where the ML algorithm is presented with labeled data (data with known
outcomes). These data are used to adjust parameters and uses the
outcomes to refine these parameters. The algorithm is then tested
without seeing the outcomes and compared to the actual outcomes to
ensure the accuracy. In supervised learning, we model relationships
between the input features and target outcome such that we can predict
the outcome values for new data based on those relationships which the
algorithm learnt from previous data. Supervised learning process is
similar to function approximation, where using an algorithm, we pick the
function that best describes the input data. However, oftentimes we are
not able to find the exact function describing the data. Also,
algorithms rely upon assumptions made by humans about how the computer
should learn. These assumptions and model inaccuracy introduces bias,
which we will cover in section xxx.~
Common algorithms:~The main types of supervised learning problems
include regression and classification problems. Some common supervised
learning algorithms include nearest neighbor, Naive Bayes, decision
trees, linear regression, support vector machines, and deep learning.
\par\null
\subsubsection*{Unsupervised Learning}\label{unsupervised-learning}
How it works: In unsupervised learning, the network learns certain
patterns or behaviors in the data without seeing an example, and only by
finding similarities between data points. This is specifically helpful
when one needs to find similar groups in the data.~Since there is no
labeled data, these algorithms are helpful when the human expert does
not necessarily know what to look for in the data. This family of~ ML
algorithms are typically used in pattern recognition and clustering.
Since there is no target outcome based on which the algorithm can model
relationships, these algorithms use techniques to mine the data for
rules, patterns, and structures, and group data points based on similar
characteristics.~
Common algorithms: The main types of unsupervised learning algorithms
include clustering and Association rule mining algorithms. Some common
unsupervised algorithms include k-means clustering, hierarchical
clustering, and Boltzman Machines.
\par\null\par\null
\subsubsection*{Reinforcement Learning}\label{reinforcement-learning}
How it works: In~reinforcement learning, the computer learns certain
behaviors or algorithms based on feedback from the environment. This
algorithm can be learnt once or it could keep adapting using continuous
feedback. Reinforcement learning overcomes some disadvantages of
supervised learning, but is more complex and computationally expensive.
Common algorithms: An example of reinforcement learning algorithms are
Q-learning, and Deep Q networks (DQN). Some applications of
the~reinforcement learning algorithms include computer played board
games (such as Chess, and Go), robotic hands, and self-driving cars.
\par\null
\subsubsection*{Building a simple machine learning
model}\label{building-a-simple-machine-learning-model}
Building a machine learning model involves some high level steps that
need to be taken in order to design and validate the model.~ The main
steps in building a general machine learning model include:
\begin{enumerate}
\tightlist
\item
Splitting the data into training and test data
\item
Build and train a machine learning model using training data
\item
Evaluate the model on the test data
\end{enumerate}
In addition to the above steps, there are some specific steps that need
to be taken to build a machine learning model in Python including
importing the required Python libraries, and some steps that need to be
taken before building the model including importing and cleaning your
data. In the next section, we will walk through a simple machine
learning problem and how we can approach it using above steps in Python.
\subsection*{Setting up the
environment}
{\label{760857}}
We use Google colab environment to run our Jupyter notebooks. it also
has GPUs, but we start with a simple model, and no need to use GPU.~GPUs
make running the model much faster. We will talk more about GPUs in the
chapter for Advanced Topics.~
The notebook is available on Github, a platform to share code with
collaborators privately or publicly. THIS IS THE GITHUB LINK
\par\null
\subsection*{House Sales in King County,
USA}\label{house-sales-in-king-county-usa}
Throughout this chapter, we will be working on a common problem in
machine learning, prediction. For this purpose, we have chosen Housing
Price Prediction dataset from Kaggle for several reasons. First of all,
most worked examples of deep learning are image and character
recognition. Although image recognition is an~ important problem to
solve, it is not a common problem when it comes to machine learning.
Most businesses deal with datasets that are comprised of several
attributes for each subject. In addition, prediction is a very common
problem in machine learning.~This is a great dataset for evaluating
simple machine learning algorithms such as linear regression, but also
leverage more complex algorithms such as deep learning.
You can access the data on Kaggle
website:~\url{https://www.kaggle.com/harlfoxem/housesalesprediction/data}.
This dataset contains house sale prices for King County, which includes
Seattle. It includes homes sold between May 2014 and May 2015. For each
house, the data includes some attributes including the date and price
the house was sold,~ number of bedrooms, bathrooms, square footage of
the home, basemen, and lot, number of floors, whether it has a view to
waterfront, whether it has been viewed, overall condition and overall
grade given to the housing unit, based on King County grading system,
year built and renovated, zip code, longitude and latitude, living room
area and lot size area in 2015. The problem we will be solving here is
to predict the housing price using the labeled data. Hence, this is a
supervised learning problem, where we could use a part of the data to
build and train our model and a keep a part of the data unseen to the
model for test purposes.
\textbf{Python Code}
Let's start by importing some basic libraries in Python.
\texttt{import\ numpy\ as\ np}
\texttt{import\ pandas\ as\ pd}
\texttt{import\ matplotlib.pyplot\ as\ plt}
\texttt{from\ sklearn\ import\ linear\_model}
Load the data into your~notebook using the url for the data uploaded to
the book's GitHub page. You can download the original dataset from
\href{https://www.kaggle.com/harlfoxem/housesalesprediction/data}{Kaggle}.~
\texttt{url\ =\ \textquotesingle{}}\href{https://raw.githubusercontent.com/Shirinn/House-Price-Prediction/master/kc_house_data.csv}{\texttt{https://raw.githubusercontent.com/Shirinn/House-Price-Prediction/master/kc\_house\_data.csv}}\texttt{\textquotesingle{}}
\texttt{df=\ pd.read\_csv(url)}
Then transform dates into year, month and day and select the columns we
are going to use for this prediction:
\texttt{df{[}\textquotesingle{}sale\_yr\textquotesingle{}{]}\ =\ pd.to\_numeric(df.date.str.slice(0,\ 4))}
\texttt{df{[}\textquotesingle{}sale\_month\textquotesingle{}{]}\ =\ pd.to\_numeric(df.date.str.slice(4,\ 6))}
\texttt{df{[}\textquotesingle{}sale\_day\textquotesingle{}{]}\ =\ pd.to\_numeric(df.date.str.slice(6,\ 8))}
\texttt{mydata\ =\ pd.DataFrame(df,\ columns={[}}
\texttt{\textquotesingle{}sale\_yr\textquotesingle{},\textquotesingle{}sale\_month\textquotesingle{},\textquotesingle{}sale\_day\textquotesingle{},}
\texttt{\textquotesingle{}bedrooms\textquotesingle{},\textquotesingle{}bathrooms\textquotesingle{},\textquotesingle{}sqft\_living\textquotesingle{},\textquotesingle{}sqft\_lot\textquotesingle{},\textquotesingle{}floors\textquotesingle{},}
\texttt{\textquotesingle{}condition\textquotesingle{},\textquotesingle{}grade\textquotesingle{},\textquotesingle{}sqft\_above\textquotesingle{},\textquotesingle{}sqft\_basement\textquotesingle{},\textquotesingle{}yr\_built\textquotesingle{},}
\texttt{\textquotesingle{}zipcode\textquotesingle{},\textquotesingle{}lat\textquotesingle{},\textquotesingle{}long\textquotesingle{},\textquotesingle{}sqft\_living15\textquotesingle{},\textquotesingle{}sqft\_lot15\textquotesingle{}{]})}
\texttt{outcome\ =\ \textquotesingle{}price\textquotesingle{}}
You could look into some descriptive statistics for each attribute using
the ``describe'' function in Pandas:
\texttt{mydata.describe()}
\par\null
\subsection*{Linear Regression}\label{linear-regression}
Before getting started with deep learning, let's get started with a
simple machine learning algorithm to understand the essentials of
building a machine learning model using Python. We will start with the
most commonly used machine learning algorithm,~i.e. linear regression,
in Python.~
Linear regression is a machine learning algorithm that can show the
relationship between an outcome (dependent variable) and a set of
attributes (independent variables) and how they impact each other. It
essentially shows how the variations in the outcome can be explained by
each of the independent variables. In business, this outcome of interest
could be~predicting sales of a product, pricing, performance, risk, etc.
independent variables are also referred to as explanatory variables,
since they explain the factors that influence the outcome along with the
degree of impact which is represented by the coefficients. These
coefficients are the model parameters and are estimated by training the
model using labeled data. In LR, the relationship between the dependent
and independent variables is established by fitting the best line.~As
you saw in previous chapter, a single neuron can act as a linear
regression model, where the best line can be represented by the
equation~\(Y\ =\ aX\ +b\)~with~\(Y\)~being the dependent
variable,~\(a\)~the slope of the
line,~\(X\)independent variable and~\(b\)~the
intercept.
\par\null
\textbf{Python Code}
A Python notebook always starts with importing required libraries. We
will be using ScikitLearn library to split the data into train and test
datasets and create a linear regression model:
\par\null
\texttt{from\ sklearn\ import\ linear\_model}
\texttt{from\ sklearn.model\_selection\ import\ train\_test\_split}
\texttt{from\ sklearn.metrics\ import\ mean\_squared\_error,\ r2\_score}
Next, we split the data into train and test datasets:
\texttt{X\_train,\ X\_test,\ y\_train,\ y\_test\ =\ train\_test\_split(X,\ y,\ test\_size\ =\ 0.33,\ random\_state=42)}
Then we create a general linear regression object using the imported
library:
\texttt{regr\ =\ linear\_model.LinearRegression()}
Next we train the model using our training dataset, to decide the linear
regression parameters that best fit this data:
\texttt{regr.fit(X\_train,\ y\_train)}
And finally, we make predictions on the test data, using the trained
model:
\texttt{y\_pred\ =\ regr.predict(X\_test)}
Let's take a look at the coefficients and the intercept of the trained
model:
\texttt{print\ (\textquotesingle{}Coefficients:\ \textbackslash{}n\textquotesingle{},\ regr.coef\_)}
\texttt{print(\textquotesingle{}Intercept:\ \textbackslash{}n\textquotesingle{},\ regr.intercept\_)}
The sign of coefficients show wether the predictors and outcome change
in the same direction, and their size shows how strongly they are
correlated with the outcome. For example, the first coefficient
regarding sale\_yr (the year the house was sold) shows a strong direct
relationship between the most recent houses sold and higher sales
prices.
We can evaluate the model using mean absolute error (MAE) and r-squared
value. MAE measure the average deviation of the predictions from the
actual outcome, without considering the direction of the error. R-square
values is explained variance score where 1 is perfect prediction.
\texttt{print(\textquotesingle{}Mean\ Absolute\ Error:\ \%.2f\textquotesingle{}\ \%mean\_absolute\_error(y\_test,\ y\_pred))}\texttt{print(\textquotesingle{}Variance\ score:\ \%.2f\textquotesingle{}\ \%\ r2\_score(y\_test,\ y\_pred))}
In above results, MAE value shows that the model estimate is on average
\$132,973.06 off the observed prices. In addition, the variance score
shows that the predictors explain 65\% of variation in sales prices.~
\par\null
\subsection*{Support Vector Machines}\label{support-vector-machines}
\par\null\par\null
\subsection*{Decision Tree ~}\label{decision-tree}
\par\null\par\null
\subsection*{Ensemble methods and Random
Forest}\label{ensemble-methods-and-random-forest}
\selectlanguage{english}
\FloatBarrier
\end{document}