Authorea

Thomas Lin Pedersen edited R&D2.tex over 9 years ago

Commit id: 71f14ad40cdfc496b3e02556e6decbbd6aab3cc4

deletions | additions

\subsubsection{Angle based} It is well known that distance meassures deteriorate as dimensionality increases. Still most multivariate outlier detection methods are based on distances. \citet{Kriegel:2008:AOD:1401890.1401946} proposed a new technique called Angle based outlier detection (ABOD) that, instead of the distance, meassures the variation in angles between a point and all other point pairs in the dataset. Outlying points will intuitively have a lower variance of angles as all other points will lie in the same direction. Conversily central points will have a high variance of angles as they are surrounded by points. The naive algorithm ha a complexity of O(pn3) for a dataset with p dimensions and n samples, which means that it is very time consuming. Faster approximations has been developed \citep{Pham:2012tq} but this is outside the scope of this report. A nice feature of ABOD is that it is parameter free and thus well fitted for unsupervised monitoring. Applying the ABOD algorithm as implemented in the ELKI framework results in roughly the same conclusions as with the sos approach (\ref{fig:abodRes}), though only run 14 stands clearly out as opposed to run 14, 10, 3, 15 and 12 in SOS. While the rationale is perhaps easier to understand than sos, it is also less adaptive to different data distributions. It is easy to envision dataclouds that would mask outliers as defined by ABOD as it doesn't acount for the neighborhood (i.e. multiple clusters of point surrounding an outlier). Another downside to ABOD that defeats the value of being parameter free, is that the output is unbounded and heavily dependant on the scale of the input data. This means that a sensible cutoff value cannot be predefined but needs to be set in a case by case fashion. Selecting the top k points as outliers as suggested by the authors runs the risk of either over- or underestimating the number of outliers and consequently deteriorate the quality of the analysis.