\documentclass[10pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\usepackage[round]{natbib}
\let\cite\citep
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\begin{document}
\title{Hands On Deep Learning with Keras and TensorFlow}
\author[1]{Alfred Essa}%
\author[1]{Shirin Mojarad}%
\affil[1]{Affiliation not available}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\selectlanguage{english}
\begin{abstract}
In this chapter, we will experiment with deep learning neural networks
using Keras. Keras is a high-level~open source neural network library
written in Python, that is capable of running on top of other deep
learning libraries such as TensorFlow and Theano. The advantage of using
Keras over TensorFlow and Theono is that it is easy to learn and use.
Using Keras to build a deep learning model is like building a prototype
architecture model, before launching into full scale build process.~
\par\null
By the end of this chapter, you will learn to build a deep learning
predictive model quickly and easily, even if you are not a neural
network specialist. This chapter is designed for anyone with minimal
Python programming background.~ You will use Keras to apply deep
learning to a real-world problem. We will work on both a simple and deep
neural networks to solve this problem, and learn how a deep learning
model could improve on a simple one.%
\end{abstract}%
\sloppy
\par\null\par\null
\subsection*{Sections and Learning
Objectives}
{\label{260193}}\par\null
Section 1,~\emph{Getting Started with}~\emph{Deep Learning,} discusses
the general framework for a deep learning project. In this section, we
will learn how to systematically approach a deep learning problem. In
addition, this section discusses the similarities and differences
between building a deep learning and machine learning algorithm.
\par\null
Section 2, \emph{Prediction Using Keras}, introduces a dataset and
problem that requires building a predictive model. In addition, this
section has step by step code and outcome to help you create your first
deep learning model using Keras.~
\par\null
Section 3,~\emph{Classifcation Using Keras}, introduces you to how to
build a classification model using Keras. Similar to Section 2, this
section has step by step code and outcome.
\par\null
\subsection*{Getting started with deep
learning}
{\label{225724}}
Building a deep learning model using Keras involves the general steps
required for building any other machine learning model including loading
the data, splitting data into train and test sets, and defining, fitting
and evaluating the model. In Keras, there is an extra step after
defining the model called compiling the model. We require this step so
that Keras can use the backend libraries such as TensorFlow. The backend
library then chooses the best way to represent the network for training
and making predictions using your hardware such as CPU, GPU or even
distributed.
\par\null
Since compiling the model allows for the backend libraries to decide on
how to train the network, you need to define training parameters in this
step. Training the network means finding the best set of weights, such
that the model can make accurate predictions. The training parameters to
be defined include the cost function used to evaluate the weights, and
the optimizer used to search through possible weights and choose the
best (optimal) weights. In addition, there are several optional metrics
which you can use to collect and report during training.
\textbf{First deep learning model}
\par\null
To build our first deep learning model in Python using Keras, we start
the basic steps required for any machine learning model in Python. This
includes importing required libraries, loading and cleaning the data,
and splitting data into train and test. Then we define our neural
network model by specifying input layer, hidden layers and output layer
and their characteristics such as number of neurons in each layer and
activation function. In addition to the usual machine learning steps,
there is an extra step in Keras to compile the model. Compiling the
model allows the underlying libraries such as TensorFlow to find the
best way to represent the network for training and implementation on
your hardware.~ In the sections, we explain each step, within the
context of an example where we build a deep learning model to predict
housing prices.
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/KerasSteps/KerasSteps}
\caption{{General steps to building a deep learning model using Keras.
{\label{392239}}%
}}
\end{center}
\end{figure}
\textbf{\emph{}}
\subsection*{House Price Prediction Using
Keras}\label{house-price-prediction-using-keras}
We will walk you through a hands-on example to apply deep learning on a
real-world data set. For this hands-on exercise, we will be using
Jupyter notebooks, Keras and TensorFlow.
Unlike most deep learning tutorials, we won't be using handwritten
digits classification from MNIST. We will be using a real estate data to
walk you through an applied example in deep learning.
\par\null
\subsection*{House Sales in King County,
USA}\label{house-sales-in-king-county-usa}
Throughout this chapter, we will be working on a common problem in
machine learning, prediction. For this purpose, we have chosen Housing
Price Prediction dataset from Kaggle for several reasons. First of all,
most worked examples of deep learning are image and character
recognition. Although image recognition is an~ important problem to
solve, it is not a common problem when it comes to machine learning.
Most businesses deal with datasets that are comprised of several
attributes for each subject. In addition, prediction is a very common
problem in machine learning.~This is a great dataset for evaluating
simple machine learning algorithms such as linear regression, but also
leverage more complex algorithms such as deep learning.
You can access the data on Kaggle
website:~\url{https://www.kaggle.com/harlfoxem/housesalesprediction/data}.
This dataset contains house sale prices for King County, which includes
Seattle. It includes homes sold between May 2014 and May 2015. For each
house, the data includes some attributes including the date and price
the house was sold,~ number of bedrooms, bathrooms, square footage of
the home, basemen, and lot, number of floors, whether it has a view to
waterfront, whether it has been viewed, overall condition and overall
grade given to the housing unit, based on King County grading system,
year built and renovated, zip code, longitude and latitude, living room
area and lot size area in 2015. The problem we will be solving here is
to predict the housing price using the labeled data. Hence, this is a
supervised learning problem, where we could use a part of the data to
build and train our model and a keep a part of the data unseen to the
model for test purposes.
\textbf{\emph{1. Import Libraries:~}}As we mentioned in previous
sections, a Python notebook always starts with importing required
libraries. We will be using Keras to create a deep learning model. We
will also use Scikit-Learn library to split the data into train,
validation and test datasets.~
\texttt{import\ numpy\ as\ np}
\texttt{import\ pandas\ as\ pd}
\texttt{from\ sklearn.model\_selection\ import\ train\_test\_split}
\texttt{from\ keras.models\ import\ Sequential}
\texttt{from\ keras.layers\ import\ Dense,\ Activation}
\texttt{from\ keras\ import\ metrics}
\textbf{\emph{2. Load the data}}:~Load the data into your~notebook using
the url for the data uploaded to the book's GitHub page. You can
download the original dataset from
\href{https://www.kaggle.com/harlfoxem/housesalesprediction/data}{Kaggle}.~Then
transform dates into year, month and day and select the columns we are
going to use for this prediction:
\texttt{url\ =\ \textquotesingle{}}\href{https://raw.githubusercontent.com/Shirinn/House-Price-Prediction/master/kc_house_data.csv}{\texttt{https://raw.githubusercontent.com/Shirinn/House-Price-Prediction/master/kc\_house\_data.csv}}\texttt{\textquotesingle{}}
\texttt{df=\ pd.read\_csv(url)}
Then transform dates into year, month and day and select the columns we
are going to use for this prediction:
\texttt{df{[}\textquotesingle{}sale\_yr\textquotesingle{}{]}\ =\ pd.to\_numeric(df.date.str.slice(0,\ 4))}
\texttt{df{[}\textquotesingle{}sale\_month\textquotesingle{}{]}\ =\ pd.to\_numeric(df.date.str.slice(4,\ 6))}
\texttt{df{[}\textquotesingle{}sale\_day\textquotesingle{}{]}\ =\ pd.to\_numeric(df.date.str.slice(6,\ 8))}
\texttt{mydata\ =\ pd.DataFrame(df,\ columns={[}}
\texttt{\textquotesingle{}sale\_yr\textquotesingle{},\textquotesingle{}sale\_month\textquotesingle{},\textquotesingle{}sale\_day\textquotesingle{},}
\texttt{\textquotesingle{}bedrooms\textquotesingle{},\textquotesingle{}bathrooms\textquotesingle{},\textquotesingle{}sqft\_living\textquotesingle{},\textquotesingle{}sqft\_lot\textquotesingle{},\textquotesingle{}floors\textquotesingle{},}
\texttt{\textquotesingle{}condition\textquotesingle{},\textquotesingle{}grade\textquotesingle{},\textquotesingle{}sqft\_above\textquotesingle{},\textquotesingle{}sqft\_basement\textquotesingle{},\textquotesingle{}yr\_built\textquotesingle{},}
\texttt{\textquotesingle{}zipcode\textquotesingle{},\textquotesingle{}lat\textquotesingle{},\textquotesingle{}long\textquotesingle{},\textquotesingle{}sqft\_living15\textquotesingle{},\textquotesingle{}sqft\_lot15\textquotesingle{}{]})}
\texttt{outcome\ =\ \textquotesingle{}price\textquotesingle{}}
You could look into some descriptive statistics for each attribute using
the ``describe'' function in Pandas:
\texttt{mydata.describe()}
\par\null
\textbf{\emph{3. Split the data into train, validation, test
datasets:~}}We use the train\_test\_split from ScikitLearn to split the
data into train and test sets. Then, we split the training set into
training and validation datasets. To achieve a split of 20\%, 20\%, 60\%
test, validation and training sets, we first split the data into 20\%
test and 80\% training, and then split the training data into 25\%
validation and 75\% training sets.
\par\null
\texttt{X\_train,\ X\_test,\ y\_train,\ y\_test\ =\ train\_test\_split(X,y,\ test\_size=0.2,\ random\_state=42)}
\texttt{X\_train,\ X\_valid,\ y\_train,\ y\_valid\ =\ train\_test\_split(X\_train,\ y\_train,\ test\_size\ =\ 0.25,}
\texttt{random\_state=2018)}
\par\null
\textbf{\emph{4. Define model}}: Deep learning models in Keras are
defined as a sequence of layers. As we learnt in previous chapters, any
deep learning model is comprised of an input layer, one or more hidden
layers, and an output layer. In Keras, we add these layers in sequence,
hence it's called a Sequential model. For the first layers, we need to
make sure that we have the correct number of inputs. This is normally
the number of attributes in data and can be found as the first dimension
of the dataset:
\par\null
\texttt{input\_dim\ =\ X\_train.shape{[}0{]}}
\par\null
Since our data has 18 attributes, the input dimension is 18. We do not
need to define the input dimension for the subsequent layers, since the
network can automatically calculate that. The difficult question is, how
do we decided number of hidden layers, and their parameters such as
number of nodes in each layer, and activation function. Deep learning
experts and practitioners suggest the process of trial and error
experimentation {[}ref{]}. The network should be large enough to capture
the structure of the problem, and small enough not to overfit the data.
In Keras, fully connected layers are defined using~\textbf{Dense} class.
You can specify the number of neurons in the layers using for the first
argument, and specify the initialization argument and activation
function as the subsequent arguments using~\textbf{init}
and~\textbf{activation}. We will discuss the possible choices for
initialization argument and activation function in the following
sections. For now, let's start with Keras default option for
initialization, i.e. '\textbf{uniform}` which chooses the initial
weights from a uniform distribution between 0 and 0.05. We use 'relu'
activation function for both input and hidden layers in this model.~
\par\null
\texttt{kmodel\ =\ Sequential()}
\texttt{kmodel.add(Dense(100,\ activation="relu",\ input\_dim\ =\ 18))}
\texttt{kmodel.add(Dense(50,\ activation="relu"))}
\texttt{kmodel.add(Dense(1))}
\texttt{print(kmodel.summary())}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/Screen-Shot-2018-04-20-at-2-12-25-PM/Screen-Shot-2018-04-20-at-2-12-25-PM}
\caption{{This is a caption
{\label{205432}}%
}}
\end{center}
\end{figure}
\par\null\par\null
\textbf{\emph{5. Compile model:}}
This step is used in Keras to to use the underlying libraries such as
TensorFlow to find the best way to represent the network for training
and implementation on your hardware. For compiling, you must specify
some additional properties for training the network such as loss
function, and optimization algorithm (optimizer). In this example, we
will use mean squared error as the loss function,~\textbf{rmsprop} as
the optimizer {[}ref to Geoff Hinton's paper{]}, and mean absolute error
'\textbf{mae}' as the evaluation measure:
\par\null
\texttt{kmodel.compile(loss=\textquotesingle{}mean\_squared\_error\textquotesingle{},}
\texttt{optimizer=\textquotesingle{}rmsprop\textquotesingle{},}
\texttt{metrics={[}metrics.mae{]})}
\par\null
\textbf{\emph{6. Fit/train model:}}
Once we compile the model, it is ready to be trained by data, remember
that training is the process of tuning model parameters and weights.
Similar to ScikitLearn, we can train a model using~\textbf{fit()}
function. During training, the model will go through the data for a
fixed number of iterations. These iterations are called epochs and
number of epochs is one of the properties of the training process for
deep learning. You can also specify the number of instances evaluated
before the network weights are updated. This is called
\textbf{batch\_size}.~
\par\null
\texttt{kmodel.fit(X\_train,\ y\_train,\ batch\_size=100,\ epochs=500)}
\par\null
\textbf{\emph{7. Evaluate model:}}
To evaluate the model on new data, we can use \textbf{evaluate()}
function.~Earlier in the compiling step, we specified mean absolute
error as the metric to report the classification accuracy.
\par\null
\texttt{scores\ =\ kmodel.evaluate(X\_test,\ y\_test)}
\texttt{print("MAE:\ "\ +\ str(scores{[}1{]}))}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/Screen-Shot-2018-04-20-at-2-30-31-PM/Screen-Shot-2018-04-20-at-2-30-31-PM}
\caption{{This is a caption
{\label{835501}}%
}}
\end{center}
\end{figure}
Comparing 170,303.89 to the results from LR, we observe that LR performs
better in predicting housing prices. So, why do we need deep learning?
\par\null
\subsection*{Classification using
Keras}
{\label{939467}}
Now that we have built a predictive model using Keras, building a
classification model sounds like a trivial task. Most tutorials consider
a classification model (where the desired outcome to be predicted is
binary or categorical) easier than a predictive model (where the desired
outcome to be predicted is continuous). However, due to sequential
nature of Keras models, building these models are equally easy (or
difficult, depending on how you found previous example). For this
example, we use another Kaggle dataset.
\subsection*{}
{\label{484057}}
\subsection*{The Otto Group Dataset}
{\label{484057}}
The Otto Group is one of the world's biggest e-commerce companies, a
consistent analysis of the performance of products is crucial. However,
due to diverse global infrastructure, many identical products get
classified differently. Kaggle provided a dataset with 93 features for
more~\emph{than}~200,000 products. The objective is to build a
predictive model which is able to distinguish between our main product
categories. Each row corresponds to a single product. There are a total
of 93 numerical features, which represent counts of different events.
All features have been obfuscated and will not be defined any further.
You can download the data directly from Kaggle website:
\url{https://www.kaggle.com/c/otto-group-product-classification-challenge/data}
Like previous example, we follow standard steps to build our Keras
model. In this example, we first build a simple perceptron model, with
no hidden layers. This model would be similar to a logistic regression
model in that the outcome is a linear combination of weighted inputs
plus a bias term.~
\par\null\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/MLP-arch/MLP-arch}
\caption{{A perceptron model (a neural network with no hidden layers) that
represent a logistic regression model
{\label{306196}}%
}}
\end{center}
\end{figure}
\par\null
\emph{\textbf{1. Import libraries}}: Let's start with importing required
libraries:
\subsection*{}
{\label{939841}}
\texttt{import\ pandas\ as\ pd}
\texttt{from\ sklearn.preprocessing\ import\ LabelEncoder}
\texttt{from\ sklearn.model\_selection\ import\ train\_test\_split}
\texttt{from\ sklearn.linear\_model\ import\ LogisticRegression}
\texttt{from\ sklearn\ import\ metrics}
\texttt{import\ keras}
\texttt{from\ keras.utils\ import\ np\_utils}
\texttt{import\ matplotlib.pyplot\ as\ plt}
\texttt{from\ keras.models\ import\ Sequential}
\texttt{from\ keras.layers\ import\ Dense,\ Activation}
\par\null
\textbf{\emph{\textbf{\emph{2. Load the data}}:}}
\texttt{df\ =\ pd.read\_csv(\textquotesingle{}}\href{https://raw.githubusercontent.com/Shirinn/House-Price-Prediction/master/train_OttoGroup_Kaggle.csv}{\texttt{https://raw.githubusercontent.com/Shirinn/House-Price-Prediction/master/train\_OttoGroup\_Kaggle.csv}}\texttt{\textquotesingle{})}
\par\null
\emph{\textbf{3. Data Preparation:~}}for data preparation, we use
one-hot encoding to convert the target into a vector that is all-zeros
except for a 1 at the index corresponding to the class of the sample.
\par\null
\texttt{df\_array\ =\ df.values}
\texttt{X\ =\ df\_array{[}:,\ 1:-1{]}.astype(np.float32)}
\texttt{labels\ =\ df\_array{[}:,\ -1{]}}
\texttt{print(np.unique(labels))}
\texttt{encoder\ =\ LabelEncoder()}
\texttt{encoder.fit(labels)}
\texttt{y\ =\ encoder.transform(labels).astype(np.int32)}
\texttt{Y\ =\ np\_utils.to\_categorical(y)}
\texttt{X\_train,\ X\_test,\ y\_train,\ y\_test\ =\ train\_test\_split(X,\ Y,\ test\_size=.33,random\_state=2018)}
\par\null
\textbf{\emph{4. Define model}}: first we look into the input dimension
(i.e. number of attributes) and how many categories we are predicting
(output dimension):
\par\null
\texttt{dims\ =\ X\_train.shape{[}1{]}}
\texttt{print(dims,\ \textquotesingle{}dims\textquotesingle{})}
\texttt{nb\_classes\ =\ y\_train.shape{[}1{]}}
\texttt{print(nb\_classes,\ \textquotesingle{}classes\textquotesingle{})}
\par\null
Similar to previous example, we define the model as a sequence of
layers, where we use Keras's sequential function as a placeholder to put
in the layers:
\par\null
\texttt{kmodel\ =\ Sequential()}
\texttt{kmodel.add(Dense(nb\_classes,\ input\_shape=(dims,),\ activation=\textquotesingle{}sigmoid\textquotesingle{}))}
\texttt{kmodel.add(Activation(\textquotesingle{}softmax\textquotesingle{}))}
\texttt{kmodel.summary()}
\par\null
\textbf{\emph{5. Compile model:}}
For compiling, you must specify some additional properties for training
the network such as loss function, and optimization algorithm
(optimizer). In this example, we will use categorical entropy as the
loss function, stocastic gradient decent '\textbf{sgd'} as the optimizer
{[}ref to Geoff Hinton's paper{]}, and accuracy '\textbf{acc}' as the
evaluation measure
\texttt{kmodel.compile(optimizer=\textquotesingle{}sgd\textquotesingle{},\ loss=\textquotesingle{}categorical\_crossentropy\textquotesingle{},\ metrics={[}\textquotesingle{}acc\textquotesingle{}{]})}
\par\null
\emph{\textbf{6. Fit/train model}:~}similar to previous example, we use
the training data to train the model and find model parameters, such the
predicted outcome as close as possible to desired outcome:
\par\null
\texttt{kmodel.fit(X\_train,\ y\_train)}
\par\null
\textbf{\emph{7. Evaluate Model}}:
\par\null
\texttt{scores\ =\ kmodel.evaluate(X\_test,\ y\_test)}
\texttt{print(\textquotesingle{}Accuracy:\ \%\textquotesingle{}\ +\ str(scores{[}1{]}*100))}
\par\null\par\null
\subsection*{Putting it together}
{\label{574357}}
All of the above steps can be summarized into few lines of code. This
easy and fast implementation is the most conspicuous characteristic of
Keras, which has made it a mainstream tool for researchers to implement
deep learning models. Fast and easy implementation allows researchers a
fast experimentation cycle, where they can decide whether deep learning
is a good option for their application.
\par\null
\texttt{kmodel\ =\ Sequential()}
\texttt{kmodel.add(Dense(nb\_classes,\ input\_shape=(dims,),\ activation=\textquotesingle{}sigmoid\textquotesingle{}))}
\texttt{kmodel.add(Activation(\textquotesingle{}softmax\textquotesingle{}))}
\texttt{kmodel.compile(optimizer=\textquotesingle{}sgd\textquotesingle{},\ loss=\textquotesingle{}categorical\_crossentropy\textquotesingle{},\ metrics={[}\textquotesingle{}acc\textquotesingle{}{]})}
\texttt{kmodel.fit(X\_train,\ y\_train)}
\texttt{scores\ =\ kmodel.evaluate(X\_test,\ y\_test)}
\texttt{print(\textquotesingle{}Accuracy:\ \%\textquotesingle{}\ +\ str(scores{[}1{]}*100))}
\par\null\par\null
\section*{Deep Learning
Hyperparameters}
{\label{460266}}
\subsection*{}
{\label{939841}}
We talked about neural networks parameters including weights and biases.
Parameters are the numbers that the machine learning algorithm learns in
the learning process. For example, in logistic and linear regression,
parameters are the coefficients that the algorithm learns.
Hyperparameters are any knobs and numbers that you as the human are in
control of. There are some obvious examples of hyperparameters, for
example in regularization it's L1, L2 and drop out rate, both choosing
the number for those parameters, but also the mere decision of whether
to use those regularization parameters. In neural networks, number of
hidden layers, number of neurons in each layer, activation function,
cost function, network optimizer, metrics to evaluate model goodness and
any other knob that you as the human specify about the algorithm are
hyperparameters. In following sections, you will learn what is the role
of each of these properties, and your options for each of these
properties.
\par\null\par\null\par\null\par\null
\subsection*{Learning Rate}
{\label{380451}}
In gradient decent algorithms, learning rate indicates how ``big'' is
the steps towards the gradient direction. In other words, learning rate
determines how quickly the algorithm moves towards the direction of
gradient.
\par\null
\subsection*{Optimizer}
{\label{690027}}\par\null
Backpropagation is the core algorithm behind how neural networks learn.
Let's first do an intuitive walk through on what the algorithm is
actually doing without any reference to the mathematics behind it. For
those of you who do want to know how the math works, there is an
appendix that goes through the calculus underlying backpropagation. So
far we learnt how neural networks feeds forward information to predict
an outcome. In our example of housing price prediction, each houses
characteristics such as year built and renovated are fed into the first
layer of the network. What we mean by learning is to find which weights
and biases minimize a certain cost function. as a quick reminder, as a
cost of a single training example, what you do is to take the output
that the network gives, along with the observed output, and just add up
the square of differences between each component. Doing this for tens of
thousands of examples and averaging the results gives the total cost of
the network. Then, we look for the negative gradient of this cost
function which tells the network on how to change the weights and biases
of all connections, so as to most efficiently decrease the cost.~
Backpropagation is an algorithm to compute the gradient. In effect, the
cost function is optimized for each training example, and therefore,
each training example has an effect on how weights and biases are
adjusted.~
\par\null
\subsection*{Metrics}
The Mean Squared Error (MSE) is a non-negative number where values
closer to zero represent a smaller error. The RMSE (Root Mean Squared
Error) is the square root of the MSE. The RMSE is~ easier to interpret
than MSE as it represents the average deviation of the model estimates
from the observed outcome.
\subsection*{Overfitting}
{\label{690027}}\par\null
Once a neural network's weights and biases are adjust to improve its
performance on the training data, we hope that what it learns
generalizes well beyond the training data. The way we test that is after
we train the network, you show it more labeled data that it has never
seen before, and you measure how accurately it classifies or predicts
the outcome.~
\par\null
\subsection*{How to Choose
Hyperparameters?}
{\label{798592}}\par\null
There are no rule of tumbs for choosing hyperparameters. However, most
hyperparameters have a default value, which is a good starting point to
create the model. After getting started with the default
hyperparameters, then you search for a better hyperparameter
combinations. There are different techniques available to find a better
combination of hyperparameters such as grid search, random search and
Bayesian optimization. Sometimes there are more hyperparameters to be
defined by the human, compared to the parameters that the algorithm
learns. So, what is the magic in deep learning if we have to find the
optimal combination of hyperparameters for the data at hand? Removing
hyperparameters has been a long standing goal of the machine learning
community. In this regard, deep learning has helped with removing two
important hyperparameters, model selection, and feature engineering.
\par\null
Neural networks introduced two powerful breakthroughs to the machine
learning community. If done properly, deep learning can represent any
complex function, hence, they are called universal function
approximaters. Therefore, theoretically, deep learning can be used for
any arbitrary problem and we do not need any other machine learning
mode. Deep learning algorithms also eliminated a very taxing
hyperparameter called feature selection, or feature engineering. In
situations where features are highly correlated, some algorithms such as
logistic and linear regression do not perform well. This problem is
called multicollinearity. In traditional machine learning algorithms,
you either need to remove the features manually, or use a dimensionality
reduction technique such as PCA, before using the features as the input
to the model. However, with deep learning, dimensionality reduction
automatically happens, as long as the layers' width is the less than the
number of inputs. Hence, deep learning algorithms tend to retain more
important bits of information and disregard the non-important parts of
it. Therefore, deep learning eliminated two main hyperparameters, the
model to choose and feature engineering and selection.
\par\null
There are some simple rules that practitioners follow to choose
hyperparameters, beyond the default values.~ For example, if the problem
at hand is not complex in nature and we have limited data at hand, then
you need a shallow model with as few as one hidden layer. Also,
depending on the properties of the problem, you might be able to choose
an activation function that helps the optimizer converge faster when
training the model. For example, in case of classification problems,
family of sigmoid functions inlcuding sigmoid and tanh generally work
better. The main drawback of sigmoid functions is vanishing gradient
problem, which can prevent the neural network from further training. In
general,~ReLU function is the default choice in deep learning models
these days, since they can model any arbitrary function. There are
several variations of ReLU function. For example, if you encounter a
dead neuron in the network,~leaky ReLU function is a better choice.~As a
general guideline, you can begin with using ReLU function and then move
over to other activation functions in case ReLU doesn't provide with
optimum results.
\par\null\par\null
\selectlanguage{english}
\FloatBarrier
\end{document}