Thomas Lin Pedersen edited R&D1.tex  over 9 years ago

Commit id: dfcbb79589a324010839b41cb325f1e472c76f35

deletions | additions      

       

\subsection{Missing-label outlier detection}  In the case of within run outlier detection it is often not possible to create a model beforehand. In this case the problem changes into a case of missing-label outlier detection, with the added complexity of being applied to a continuous data stream. To cope with the latter, three strategies have been identified: Only compare the recent sample with older samples (i.e. don't recalculate old samples as new ones are added), recalculate everything once a new sample is added or use a moving window to only recalculate for the last k samples. In the case of LC-MS/MS analyses the data-acquisition is almost always the time limiting step, so recalculating values even for time consuming algorithms is feasable. In this report focus will be on the 2\textsuperscript{nd} approach as the run size will normally be so small that it never gets computational prohibitive. In addition a moving window across 15 points doesn't seem sensible, as the window size needs to have a certain size for outlier detection to be meaningful. The first approach, while ensuring a stable output (once a point has been computed it doesn't change), also result in the first member of a new cluster being permanently labelled as an outlier.  \subsubsection{Distance based}  One of the most often used distance based approaches in multivariate outlier detetion is to use a robust Mahalanobis distance \citep{Chen:2008vm}. A QQ plot will reveal points that are probable outliers as they will show deviation from the linear relationship in the upper right corner. While easy to understand it has some downsides. The mahalanobis distance assumes an elliptical shape of the datacloud; an assumption that can easily be violated in real data. Furthermore it can only be applied in situations where n > p. The latter makes it impractical in our case as it would require more than 45 samples before outlyingness can be assessed.