Authorea

Thomas Lin Pedersen edited Results and Discussion.tex over 9 years ago

Commit id: 614317bf17f682ec79c19f007d18d2177215d5e6

deletions | additions

\section{Results and Discussion} Outlier detection approaches can be classified in many ways. One prevalent way is looking at it either as a one-class learning problem or a missing label learning problem. In the former a set of good samples are used to define a model for conforming samples. Outliers are then samples deviating from this model. In the latter all samples are considered as missing a label (outlier or non-outlier) and prior assumptions about the data is used to build up a model that can assign labels to the samples. This division is sound in our case as the one-class learning problem is well suited for run to run QC as this would usually invovle having the same sample analysed as part of every analysis and compare it to a set of already accepted analyses of this sample. For within-run QC a prior set of acceptable samples are not available, as it is often new samples being analysed. In that case the outliers must be catched by constantly evaluting the current sample set with respect to itself and not some external reference. As example data the Velos data from Vanderbilt University Medical Center was used as it constituted the most samples over a long period of time. All samples were divided into runs by looking at the time difference between the sample and the next. A time difference exceeding 2 hours constituted the start of a new run. Using this approach 37 runs were identified in the dataset with a median runlength of 15 samples. Two of the runs only included one sample and were subsequently removed. In each run the first, middle and last sample were assigned to be standards used to monitor between run variation. A stable instrument period between Feb. 25 and April 15 2013 were identified and the samples from that period was used as a training set. The training set thus included 89 samples. \subsection{One-class outlier detection} \subsubsection{PCA} One of the most used tools for multivariate process control is PCA so this is a natural starting point for our investigation. \strong{PCA variants} \subsubsection{Random Forest} \subsection{Missing-label outlier detection} \subsubsection{Distance based} include angle variance bagging of mahalanobis distances \subsubsection{Revisit PCA}