Authorea

Thomas Lin Pedersen edited Results and Discussion.tex over 9 years ago

Commit id: 1f0c56b88f34af27b6f65dde033691066278b567

deletions | additions

\subsection{One-class outlier detection} \subsubsection{PCA} One of the most used tools for multivariate process control is PCA so this is a natural starting point for our investigation. While PCA is notoriously sensitive to outliers and robust version are often preferred, the trainingset is void of outliers and the choice of algorithm should thus not have a big effect. A quick comparison of NIPALS \citep{citeulike:8609111}, Bayesian PCA (bpca) \citep{Oba_2003}, Probabilistic PCA (ppca) \citep{Roweis98emalgorithms} and Robust PCA (rpca) as implemented in pcaMethods \citep{pcaMethods} shows just that and NIPALS will be used onward. The model reaches an optimum Q^2 value at 5 components were 61\% variance is explained. Using this model it is possible to create 6 control charts for the test data (One for each component and one for distance to model) \strong{PCA variants} \subsubsection{Random Forest} model). The nice thing about this approach is that it allows the user to monitor drifts in several different dimensions. The problem is that, while some obvious outliers are visible, more subtle outliers related to combinations of several PCs does not stand out well. It is possible to get a better view using a scatterplot matrix, but that will still only visualise outliers defined by 2 PCs. Furthermore this will be at the expense of the time dimension which is paramount to identifying drifts. \subsubsection{One-class Support Vector Machine} One of the most classical one-class outlier detections is one-class svm (osvm) where an svm is trained to contain a set of samples in the most efficent way. Outliers are then defined as samples laying outside the bounds of the support vectors. Osvm is a hard classification technique an the output will only be outlier/non-outlier for every sample. Thus it is not useful for monitoring slow drifts in the output as PCA is, but can compliment such a method by labelling suspecious samples that might hide themself across multiple dimensions. To investigate the use of osvm on our data an svm was trained to the training data. Due to the high dimensionality of the data \subsection{Missing-label outlier detection} \subsubsection{Distance based}