\documentclass{article}
\usepackage[affil-it]{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
\usepackage{url}
\usepackage{hyperref}
\hypersetup{colorlinks=false,pdfborder={0 0 0}}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[ngerman,english]{babel}
\begin{document}
\title{Supervised Learning: Classification and Regression}
\author{Naets}
\affil{Affiliation not available}
\author{Awaiting Activation}
\affil{Affiliation not available}
\author{JP Breuer}
\affil{Affiliation not available}
\date{\today}
\maketitle
\section{Regression}
\label{RegSection}
\subsection{Introduction}
The goal of Section \ref{RegSection} is to explain the construction of the prices of the cars of our cars dataset.
The following subsections focus on applying Forward feature selection using Least squares regression (FFSOLS) and an Artificial Neural Network (ANN) to try to fit best and explain the formation of such prices.
In addition to the FFSLS regression presented in Subsection \ref{FFSOLSRegressionSubsection}, one can find in Appendices \ref{RegressionOLSSubsection} and \ref{RegressionLASSOSubsection} respectively an application of OLS regression with manual Features Selection based on statistics (F-test and so forth instead of cross validation) and a Lasso regression which is briefly introduced.
\subsection{Forward Feature Selection with OLS Regression (FFSOLS regression)}
\label{FFSOLSRegressionSubsection}
FFSOLS regressions based on cross validation are applied in the following two Subsections. 10-Folds cross-validation is used as the outer level of the cross validation since it is often seen as a relatively efficient number of Folds for cross validation. Our initial dataset contains 159 observations and leaves each training set with approximately 144 observations which sounds sufficient for the training of the models. On the other hand, this means each test dataset is composed of around 16 observations which can lead to mean test errors that should reasonably depict the approximated generalized error of each of the 10 best models constructed within the inner part of the cross-validation.
And since we set the internal cross validation of the forward selection to 10, this means the inner training sets of each outer training sets contain around 130 observations each to compute the models containing the improvement brought by each attribute, and approximately 14 observations per inner test set to measure the improvement of each model with an added attribute (i.e. the reduction in the error of the model).
\subsubsection{First FFSOLS regression}
A first FFSOLS regression is applied on the original dataset (i.e. not mathematically transformed).
One can see on Figure \ref{FsRegressionVariables} the remaining attributes for each of the 10 models that minimized the mean test error of the test data of the inner layer of the cross validation procedure. In addition, the mean $R^{2}$ of these 10 models is presented and one can see approximately 87~\% of the variance of the variable price is explained by the attributes of the models in mean.
One can also note some attributes appear to be more stable (i.e. appear in more models) than others. Such aspect will be discussed in the retained model.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.42\columnwidth]{figures/FsRegressionVariables/FsRegressionVariables}
\caption{{Attributes retained in each of the 10 models that minimized the the mean test error of the test data in the inner layer of the cross-validation using the not transformed data
\label{FsRegressionVariables}%
}}
\end{center}
\end{figure}
We selected the model that minimized the estimated generalized error, that is the $^{4th}$ model.
The residuals as a function of the attributes used to construct the model are presented in Figure \ref{ResidFsRegression}. This allows one to have an idea of the possible heteroscedasticity (i.e. the non constance of the variance of the residuals as a function of one attribute) present in the model or not. The attributes responsible for the heteroscedasticity of the residuals should be transformed\footnote{One should be aware of that binary variables, even if not homoscedastic, cannot be improved using mathematical transformations such as logarithmic or exponential ones and are therefore not discussed.}.
For instance, one could suspect the variable 'Bore' to have such impact since the variance of the residuals seems to increase with the 'Bore' variable. However, it seems relatively difficult to judge whether that visual shape is not only driven by a few outliers. To have a better idea, one could compute Breush-Pagan or White tests which can indicate which variables might be touched by such phenomenon. However, in order to remain brief in this work, we instead compute models (see Sub-sub-section \ref{FFSOLSLogPriceSubsubsection}) using the FFSOLS regression with the natural logarithm of the dependent variable (i.e. the price), technique which is often seen to reduce the global amplitude of heteroscedasticity.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/ResidFsRegression/ResidFsRegression}
\caption{{Residuals of the first FFSOLS model as a function of the attributes of the model
\label{ResidFsRegression}%
}}
\end{center}
\end{figure}
\subsubsection{FFSOLS regression using the natural logarithm of the price}
\label{FFSOLSLogPriceSubsubsection}
The 10 models resulting from the cross-validation are presented in Figure \ref{FsRegressionVariablesLog}. One can see that now 89~\% of the price is explained by the attributes, which is slightly more than for the models based on the original data (0.87). Therefore, this is a first argument in favor of the log model.
Once again, one can note certain variables are more stable than other, especially 'Curbweight', 'Horsepower', 'Eur' (which was not significant in the models on the original data) and 'Brand123'. On the other hand, variables such as 'CompressionRatio', 'DriveWheelFwd', 'DriveWheelRwd' and 'Fuel' only appear in one of the 10 best models.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.42\columnwidth]{figures/FsRegressionLogVariables/FsRegressionLogVariables}
\caption{{Attributes retained in each of the 10 models that minimized the mean test error of the test data in the inner layer of the cross-validation using the natural logarithm of the price
\label{FsRegressionVariablesLog}%
}}
\end{center}
\end{figure}
The residuals as a function of the explanatory attributes of the best model (i.e. the $1^{st}$ and $10^{th}$ ones) are presented in Figure \ref{ResidFsRegressionLog}. It seems now much more difficult to find traces of non constant variance regarding the different remaining attributes.
Therefore, we choose to keep such model as the final FFSOLS regression one. The Table containing the coefficients is presented in \ref{RegressVariablesCoeff_exportLog} in Sub-sub-section \ref{ModelInterpretationSubsubsection}.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/ResidFsLogRegression/ResidFsLogRegression}
\caption{{Residuals of the best FFSOLS model using the natural logarithm of the price as a function of the attributes of the model
\label{ResidFsRegressionLog}%
}}
\end{center}
\end{figure}
\subsubsection{Model interpretation}
\label{ModelInterpretationSubsubsection}
Table \ref{RegressVariablesCoeff_exportLog} shows one the coefficients of the retained model. In addition, the normalized coefficients (estimated based on the normalized attributes) are presented in order for one to be able to compare the magnitude of each attribute with respect to the others. The table is sorted based on the absolute values of the normalized coefficients.
The biggest contribution to the explanation of the price is given by the 'weightcurve' which is basically the weight of the car when empty. The heavier the car, the more expensive. Follow the horse power of the car and the brand for which, the bigger, the more expensive the car.
After that, the 'Amer' variable shows one that an American car will be in mean less valuable than a car from elsewhere (one can also see that European cars worth more than other cars in mean).
The larger the wheel base of a car (see 'WhellBase variable'), the more expensive the car. However, such variable as a small impact on the price compared to the curb weight for instance (e.g. a bit more than a fourth here).
One can also see other things such as the fact that 4 doors cars are in mean slightly more expensive than others. However, the impact on the price is much smaller.
\textbf{input{RegressVariablesCoeff_exportLog}}
\subsection{Artificial Neuronal Network (ANN)}
Different Artifial Neuronal Networks of type 'Feed Forward Multilayer Perceptron' with one hidden layer are applied to our dataset in order to predict the price of second hand cars.
Once again, 10-Fold cross-validation is used to produce the models.
The number of training networks was set to 10. This is because the initial weights of the model are randomly set and can greatly influence the quality of the final model. Therefore, by taking a large number of networks for each Fold, we increase the chances of finding a good model. However, taking a too big number would lead to enormous computation time on a laptop, which is sought.
In order to find an efficient model, different number of hidden units with different stop criteria (i.e. train mse to be reached and number of epochs in the training) were used.
Using cross-validation, we computed the generalized errors of the ANN models with 1 to 20 hidden units by taking the mean of the test errors of each of the best models for each Fold. One can see on Figure \ref{ResidAnnTest} showing these generalized errors for the 20 models that 15 hidden layers seems optimal in our situation.
We therefore chose the $15^{th}$ model which minimizes such error in order to be compared with the best model obtained from the FFSOLS regression.
\paragraph{}
Secondarily, one can see on Figure \ref{ResidAnnTraining} (showing the training error of the best number of hidden units (15 in this case) as a function of the number of iterations for each of the 10 Folds) that the first criterion (reaching an mse of 100 of smaller) was never achieved for the best models of the different Folds and therefore continued until all the iterations where complete (i.e. 200 epochs). In addition, we can see on such Figure that for most of the models, the training prediction errors get relatively stabilized after 120 iterations.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/ResidAnnTraining1/ResidAnnTraining1}
\caption{{Training error of the best models of the 10 Folds as a function of the iterations
\label{TrainingErrorANN}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/ResidAnnTest1/ResidAnnTest1}
\caption{{Mean-square errors of the best models of each fold on their respective test datasets.
\label{ResidAnnTest}%
}}
\end{center}
\end{figure}
\subsection{Comparison of the results of regression and Neuronal network methods}
In order to compare the best FFSOLS regression and ANN models, one must first train these models and compute their estimated generalized error based on the same k-Folds. Since certain observations easier to predict might be contained in certain Folds of a model and not in those of the other model if the folds are different, using the same folds for both models allows one to avoid such issue.
Once again, 10-Fold cross-validation was chosen in order to estimate the generalized error. we can then compare the 10 test errors of each of the two models using a t-test in order to see if the vectors of test errors belong to the same population or not. Such t-test is given in Figure \ref{ResidAnnTraining} in addition to the corresponding box-plots of these vectors of test errors.
One can see that even though the visual impression given by the box-plots might indicates that the FFSOLS regression is globally more efficient than the ANN model, such assumption is not statistically significant as shown by the p-value of the test which is much bigger than 0.05 (i.e. the threshold for a 95~\% confidence interval).
We can therefore not conclude that one model is more efficient than the other. However, one should be aware of that one of the assumptions in order to use t-tests, i.e. the normality of the two samples compared, is far from being fulfilled and hence, the power of such test must be put into perspective.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/RegAnnBoxPlotPerformance1/RegAnnBoxPlotPerformance1}
\caption{{Box-plots of the test-errors of the FFSOLS regression and the ANN models and the p-value of their t-test (p-value >= 0.05 rejects the hypothesis of different distributions to a threshold of 95~\%)
\label{ResidAnnTraining}%
}}
\end{center}
\end{figure}
To verify that the performance of each of those two models are better than predicting the output to be the average of the training data output, the vectors of test errors of the two models are also compared to the vector of the error computed based on the mean of the training data output.
One can clearly see, even though the assumption of normality of the vectors is still not fulfilled, that both models are much better than simply predicting the mean of the output variable as shown on Figures \ref{RegMeanBoxPlotPerformance} and \ref{AnnMeanBoxPlotPerformance}.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/AnnMeanBoxPlotPerformance1/AnnMeanBoxPlotPerformance1}
\caption{{Box-plots of the test-errors of the ANN model and the mean of the price and the p-value of their t-test (p-value >= 0.05 rejects the hypothesis of different distributions to a threshold of 95 %)
\label{RegMeanBoxPlotPerformance}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/RegMeanBoxPlotPerformance1/RegMeanBoxPlotPerformance1}
\caption{{Box-plots of the test-errors of the FFSOLS regression model and the mean of the price and the p-value of their t-test (p-value >= 0.05 rejects the hypothesis of different distributions to a threshold of 95 %)
\label{AnnMeanBoxPlotPerformance}%
}}
\end{center}
\end{figure}
\section{Classification}
\label{ClassificationSection}
\subsection{Introduction}
Throughout Section \ref{ClassificationSection}, different classification methods (i.e. Decision Trees, Multinomial Regression, K-Nearest Neighbors, Na\selectlanguage{ngerman}ïve Bayes and Artificial Neural Neetworks) are applied on the car dataset in order for one to classify cars based on their continent origin (i.e. Europe, Asia or America).
\subsection{Application of classification methods}
\subsection{Prediction of a new observation in validated models}
\subsection{Statistical comparison of the two best performing models}
\section{Comparison with literature}
\label{ComparisonLiteratureSection}
In this Section, the best regressions and classifications models produced in this report are compared with models previously produced in the literature.
Concerning the regression part, we could not find related articles presenting their results.
\section{Appendix}
\subsection{Regression OLS}
\label{RegressionOLSSubsection}
In this section the use of Ordinary Least Squares Regression is discussed.
Since the independence of the explanatory variables is one of the assumptions of the OLS regression, one should be careful when selecting these independent variables in the model at the risk of getting severe multicollinearity in the model.
Even though no correlation between variables does not necessary mean that these variables are independent, a practical way of reducing such risks of multicollinearity is to base the selection of the independent variables based on their correlation and exclude from the model the explanatory variables characterized by absolute values of their coefficients greater than 0.7\footnote{In the literature, the the rule of thumb for such coefficients varies from 0.7 to 0.85.}
A selection of explanatory variables based on their correlation matrix presented in the previous report was made.
\paragraph{}
Furthermore, one should be careful not to keep variables which are not statistically significant to a certain threshold (e.g. p-values of the coefficients smaller or equal to 0.05). In addition to prevent one from using variables that are poor predictors, this allows one to produce sparser models, which is usually sought in modeling.
A possibility for only accounting for significant variables is to use the stepwise regression. Basically, such method uses an algorithm which can be summarized as follows \cite{IMM2012-06787}~;
\begin{itemize}
\item adding the most correlated explanatory variable with the price~;
\item producing the model~;
\item testing all the variables in models using a F-test to verify these are statistically significant~;
\item removing all variables that do not meet the threshold defined in the previous step~;
\item add the most correlated explanatory variable with the price that has not been included in the model yet and reproducing the other previous steps~;
item Once all the variables were included at least once in the model and once all the non significant variables were brought out of the model, the algorithm stops.
\end{itemize}
However, since the dimensionality of the problem is not too high, a manually approach is performed in the following Sub-section (\ref{FitModelSubsubSec}).
\subsubsection{Fit of the model}
\label{FitModelSubsubSec}
Table \ref{1stOLSCoeff.tex} shows one the resulting coefficients with their p-values, confidence interval and so forth, for the first OLS regression using all the explanatory variables. One can easily note many of these variables are not statistically significant.
\textbf{input{1stOLSCoeff}}
\paragraph{}
Table \ref{LastOLSCoeff.tex} shows one the variables remaining after only using the explanatory variables not too highly correlated among them and which remain significant for explaining the formation of the prices.
Figure \ref{Heteroscedasticity} shows one that the last model does not show high heteroscedasticity (i.e. specific pattern in the residuals). Therefore, no specific mathematic transformation will be applied to the variables.
\textbf{input{LastOLSCoeff}}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/Heteroscedasticity1/Heteroscedasticity1}
\caption{{Residuals VS the remaining quantitative variables in the model
\label{Heteroscedasticity}%
}}
\end{center}
\end{figure}
One can see the residuals of the model presented in figure \ref{DistriResiOLSFinal} have mean zero and are symmetrical. However, one can not assume the distribution of the residuals to be normal as shown by the p-value of the Shapiro-Wilk test. On the other hand, the graphic on the right of Figure \ref{DistriResiOLSFinal} demonstrates the residuals are likely to be Laplace distributed (See the p-value of the Anderson-Darling test). One should therefore be careful when using such model for predictions since the assumption of normality of the residuals is not full-filed.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/DistriResiOLSFinal1/DistriResiOLSFinal1}
\caption{{Distribution of the residuals of the final OLS model.
\label{DistriResiOLSFinal}%
}}
\end{center}
\end{figure}
Nonetheless, one can see on table \ref{LastOLSSummary} the summary of the final model, with an $R^{2} = 0.9$ which means 90~\% of the variance of the residuals is explained by the remaining explanatory variables.
\textbf{input{LastOLSSummary}}
\subsubsection{Prediction of new data observation}
From the quantitative variables, only the CurbWeight and the number of cylinders remain. This is mainly due to the high correlation among those most of those quantitative variables which forced us to only keep those two variables. We can see the price of cars increase with the number of cylinders (more than 1~500~\$\footnote{The unit of the prices of this dataset is not provided but we can probably those to be expressed in \$.}) and with the weight of the car.
Concerning the dummy (i.e. binary) variables, one can see, all else being equal, that American cars (and the Asian ones to a lesser extent) worth less. On the other hand, cars equiped with a turbo (i.e. variable 'Aspiration') and with power transmitted to the back wheels worth respectively nearly 2~000~\$ and slightly more than 1~500~\$ in mean. To finish with, gas cars tend to worth less than diesels ones (-1~000~\$) and good branding (i.e. variable 'Brand01TG') explains nearly 4~000~\$ of the price of a car in mean.
\subsection{Regression LASSO}
\label{RegressionLASSOSubsection}
Lasso regression is part of the regularization regression methods which allows one not only to minimize the norm of the residuals, but also to minimize the absolute value of the coefficients (using a $\alpha$ coefficient). By doing so, one's regression will be less biaised if the data contains outliers\footnote{OLS method is highly sensitive to outliers since the Norm-2 is chosen.}. Moreover, by minimizing the absolute value of the coefficients, such method sparses the variables by setting to zero the ones that are not significantly related to the dependent variable and the ones that are too highly correlated among them. Such robust method is therefore particularly desirable when working with high dimensional datasets.
The one drawback of the method is that when two explanatory variables are to highly dependent between them, the variable which coefficient is set to zero is not necessarily the same if the method is used again on the same dataset. However, this should not prevent us from trying to apply such method.
\subsubsection{Fit of the model}
The method is applied using the function 'LASSO' from the package 'sklear'. One can see on table \ref{LassoSparsity} that the bigger the value of alpha, the sparser the model (if one used $\alpha = 0$, the regression would be similar to an OLS one). Such reduction of the variables with the increase of $\alpha$ can be shown more graphically using the function 'Lars' of package 'sklearn' using the method 'Lasso'.
The resulting Lasso path is shown on Figure \ref{LassoLarsePath}. On can see the amplitude of the coefficient tend to shrink and finally get to zero when alpha increases (here when going to the left).
\textbf{input{LassoSparsity}}
In order to select the model with the best alpha, one can use the K-fold cross-validation as a criterion\cite{friedman2010regularization}. Such criterion is implemented in 'sklearn' package under the name 'cross_validation.KFold'.
However, since such method was not required for the report, the cross-validation of such regression type is not presented here and is left to the readers as exercise.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/LassoLarsePath1/LassoLarsePath1}
\caption{{Lasso Path with Lars algorithm
\label{LassoLarsePath}%
}}
\end{center}
\end{figure}
\selectlanguage{english}
\FloatBarrier
\bibliographystyle{plain}
\bibliography{bibliography/converted_to_latex.bib%
}
\end{document}