Thomas Lin Pedersen edited R&D1.tex  over 9 years ago

Commit id: 1afd129d4a150c36385352a5434da970b83b93ce

deletions | additions      

       

A novel distance based approach is called stochastic outlier selection (SOS) \citep{Janssens:2012wr} which borrows a lot of theory from the t-SNE algorithm for dimensionality reduction \citep{vanderMaaten:2008tm}. It utilieses graph theory to calculate the probability that a point will connect to its neighbors in a stochastic neighbor graph (SNG). This is converted to the probability that a point will get no connection (a zero in-degree) in an SNG. The output is thus a soft classification of outlierness but by setting an appropriate threshold it can be converted to a hard classifier. The algorithm only takes a single parameter, called perplexity, which is a meassure of the size of the neighborhood to take into account when assessing a sample. It can roughly be translated to the number of points in the neighborhood but can take any positive real value. Evaluating the perplexity parameter on our test data shows that the conclusions are rather stable at least as long as the dataset doesn't show obvious clusters. If that is the case it is possible that the parameter needs to be tuned to allow small clusters to not be flagged as outliers. Another possibility is to just use several perplexity values and then get a range of measures that covers local to global outlyingness.  Using a perplexity of 5 gives the results visualised in figure \ref{fig:sosRes}. 5 samples stands out with run sample  14 being the most obvious, both according to a crude PCA model and the calculated probability scores. What the SOS result also states is that run sample  10, 15 and to some degree 12 also warrent investigation. The run sample  with the second highest outlier probability (run (sample  10) only appears outlying in the 5th principal component (not shown), further underlining that only looking at the first few component will not alert the user on all suspecious runs.