Outlier detection/Anomaly detection

Outliers detection is a crucial process in DataScience, data analysis, processing and modeling and more generally when one wants to extract information from data. This statement is even more true when the machine learning algorithms evolved to extract information from the data are not robust.

Therefore, this Section tackles the task of highlighting potential outliers. In order to do so, four different methods applied to our dataset are presented in Sub-sections \ref{GaussianKernelDensitySubsubsection}, \ref{KNNDensitySubsubsection}, \ref{KNNAverageRelativeDensitySubsubsection} and \ref{KthNearestNeighbotSubsubsection}.

Then, the results of the previous tests are combined to emphasis the observations (i.e. the cars) that are most likely to be outliers.

Finally, the observations that are most likely to be outliers are analyzed to decide whether or not one should exclude them from the dataset before performing regressions or classifications.

Scoring methods on observations

Some of the methods presented in the following Sub-sections are more efficient when applied on scaled data and especially on orthogonal data space. Hence, we choose to project our normalized observations on their corresponding principal components before applying the outliers detection methods.