Thomas Lin Pedersen edited R&D1.tex  over 9 years ago

Commit id: d16ee5c204fb3c32e00fae8dae6ff3de04dce3e1

deletions | additions      

       

\subsection{Missing-label outlier detection}  In the case of within run outlier detection it is often not possible to create a model beforehand. In this case the problem changes into a case of missing-label outlier detection, with the added complexity of being applied to a continuous data stream. To cope with the latter, three strategies have been identified: Only compare the recent sample with older samples (i.e. don't recalculate old samples as new ones are added), recalculate everything once a new sample is added or use a moving window to only recalculate for the last k samples. In the case of LC-MS/MS analyses the data-acquisition is almost always the time limiting step, so recalculating values even for time consuming algorithms is feasable.  \subsubsection{Distance based}  One of the most often used distance based approaches in multivariate outlier detetion is to use a robust Mahalanobis distance \citep{Chen:2008vm}. A QQ plot will reveal points that are probable outliers as they will show deviation from the linear relationship in the upper right corner. While easy to understand it has some downsides. The mahalanobis distance assumes an elliptical shape of the datacloud; an assumption that can easily be violated in real data. Furthermore it can only be applied in situations where n > p. The latter makes it impractical in our case as it would require more than 45 samples before outlyingness can be assessed. A very novel distance based approach is called stochastic outlier selection (SOS) \citep{Janssens:2012wr} which borrows a lot of theory from the t-SNE algorithm for dimensionality reduction \citep{vanderMaaten:2008tm}. It utilieses graph theory to calculate the probability that a point will connect to its neighbors in a stochastic neighbor graph (SNG). This is converted to the probability that a point will get no connection (a zero in-degree) in an SNG. The output is thus a soft classification of outlierness but by setting an appropriate threshold it can be converted to a hard classifier. The algorithm only takes a single parameter, called perplexity, which is a meassure of the size of the neighborhood to take into account when assessing a sample. It can roughly be translated to the number of points in the neighborhood but can take any positive real value. Evaluating the perplexity parameter on our test data shows that the conclusions are rather stable at least as long as the dataset doesn't show obvious clusters. If that is the case it is possible that the parameter needs to be tuned to allow small clusters to not be flagged as outliers. Another possibility is to just use several perplexity values and then get a range of measures that covers local to global outlyingness. Using a perplexity of 5 gives the results visualised in figure \ref{fig:sosRes}. 5 samples stands out with run 14 being the most obvious, both according to a crude PCA model and