Supervised Learning: Classification and Regression

\label{RegSection}

The goal of Section \ref{RegSection} is to explain the construction of the prices of the cars of our cars dataset.

The following subsections focus on applying Forward feature selection using Least squares regression (FFSOLS) and an Artificial Neural Network (ANN) to try to fit best and explain the formation of such prices.

In addition to the FFSLS regression presented in Subsection \ref{FFSOLSRegressionSubsection}, one can find in Appendices \ref{RegressionOLSSubsection} and \ref{RegressionLASSOSubsection} respectively an application of OLS regression with manual Features Selection based on statistics (F-test and so forth instead of cross validation) and a Lasso regression which is briefly introduced.

\label{FFSOLSRegressionSubsection}

FFSOLS regressions based on cross validation are applied in the following two Subsections. 10-Folds cross-validation is used as the outer level of the cross validation since it is often seen as a relatively efficient number of Folds for cross validation. Our initial dataset contains 159 observations and leaves each training set with approximately 144 observations which sounds sufficient for the training of the models. On the other hand, this means each test dataset is composed of around 16 observations which can lead to mean test errors that should reasonably depict the approximated generalized error of each of the 10 best models constructed within the inner part of the cross-validation.

And since we set the internal cross validation of the forward selection to 10, this means the inner training sets of each outer training sets contain around 130 observations each to compute the models containing the improvement brought by each attribute, and approximately 14 observations per inner test set to measure the improvement of each model with an added attribute (i.e. the reduction in the error of the model).

A first FFSOLS regression is applied on the original dataset (i.e. not mathematically transformed).

One can see on Figure \ref{FsRegressionVariables} the remaining attributes for each of the 10 models that minimized the mean test error of the test data of the inner layer of the cross validation procedure. In addition, the mean \(R^{2}\) of these 10 models is presented and one can see approximately 87 % of the variance of the variable price is explained by the attributes of the models in mean.

One can also note some attributes appear to be more stable (i.e. appear in more models) than others. Such aspect will be discussed in the retained model.