Analysing quality control metrics from mass spectrometry data files


Mass spectrometry based proteomics (from here on proteomics) has greatly improved the coverage of protein analyses, compared to older gel-based methods. Usually the mass spectrometer is coupled to a liquid chromatograph for better separation of the complex samples by means of an electrospray ionisator (ESI). This setup (usually refered to as LC-MS/MS), though powerful, is subject to a lot of unstability. Sample-to-sample, instrument-to-instrument as well as day-to-day variations are a sad fact of doing proteomics. Despite of this, rigorous quality control have yet to become a standardized part of proteomic pipelines. Sample-to-sample variations are usually assessed after the data acquisation has ended and, if not too severe, normalisation is employed to circumvent it. Standard practise is usually to only compare samples from the same run in order to avoid instrument-to-instrument as well as day-to-day variation in the data, but while this approach is statisitically sound it removes the possibility of collecting data from multiple sources as well as tracking the performance of the equipment over time.

Some effort have been put into the area, mostly centered around defining metrics that can be extracted from raw data files and used to monitor the different aspect of the instrumentation. The first iteration of this effort (Rudnick et al., 2010) defined 46 different metrics that could be extracted from a raw LC-MS/MS file and used to trace subtle variations back to different parts of the instrumentation. Recently QuaMeter (Ma et al., 2012) refined these and was used to compare data from several different laboratories (Wang et al., 2014). For the latter study robust PCA was used to investigate the different samples, but apart from this no foray into more advanced computational data analysis algorithms has been attempted. Furthermore the study, while adressing the need for on-line quality control, only investigated the variability of samples in a post-analysis manner.

This paper will investigate the use of other algorithms on the same data as employed by Wang et al. (2014) with the aim of finding a data-analytical approach that will be well-suited for automatic continuous quality control of LC-MS/MS equipment for both within- and between-run monitoring. As such it will forego the problem of instrument-to-instrument variation since this has already been adressed, and because this would normally not be done in an automated manner anyway.

Materials and Methods


The data used in this study has been provided by Wang et al. (2014) and match that used in their paper. It is a collection of metrics extracted using QuaMeter from samples across a range of US laboratories (Vanderbilt University Medical Center, Pacific Northwest National Laboratory, Broad Institute and John Hopkins University). As example data the Velos data from Vanderbilt University Medical Center was used as it constituted the most samples over a long period of time. All samples were divided into runs by looking at the time difference between the sample and the next. A time difference exceeding 2 hours constituted the start of a new run. Using this approach 37 runs were identified in the dataset with a median size of 15 samples. Two of the runs only included one sample and were subsequently removed. In each run the first, middle and last sample were assigned to be standards used to monitor between run variation. A stable instrument period between February 25th and April 15th 2013 were identified and the samples from that period was used as a training set for between run variation. The training set thus included 89 samples. For within run analysis run 6 was chosen (August 31th 2012 ff) as it constituted 15 samples including one obvious and a few subtle outlier samples.

Data analysis

All analyses have been done in the statistical computing environment R (Team, 2014) using additional packages that will get referenced accordingly when described. The code used for performing the analyses is available in the appendix. The one exception is for the calculation of Angle Based Outlier Detection which was done using ELKI (Achtert et al., 2013) as this software contained the only known implementation.

Results and Discussion

Outlier detection approaches can be classified in many ways. One prevalent way is looking at it either as a one-class learning problem or a missing label learning problem. In the former a set of good samples are used to define a model for conforming samples. Outliers are then samples deviating from this model. In the latter all samples are considered as missing a label (outlier or non-outlier) and prior assumptions about the data is used to build up a model that can assign labels to the samples. This division is sound in our case as the one-class learning problem is well suited for run-to-run QC as this would usually invovle having the same sample analysed as part of every analysis and compare it to a set of already accepted analyses of this sample. For within-run QC a prior set of acceptable samples are not available, as it is often new samples being analysed. In that case the outliers must be catched by constantly evaluting the current sample set with respect to itself and not some external reference.

One-class outlier detection


One of the most used tools for multivariate process control is PCA so this is a natural starting point. While PCA is notoriously sensitive to outliers and robust version are often preferred, the training set is void of outliers and the choice of algorithm should thus not have a big effect. A quick comparison of NIPALS (Wold, 1966), Bayesian PCA (bpca) (Oba et al., 2003), Probabilistic PCA (ppca) (Roweis, 1998) and Robust PCA (rpca) as implemented in pcaMethods (Stacklies et al., 2007) shows just that and NIPALS will be used onward. The model reaches a maximum Q2 value at 5 components were 61% variance is explained. Using this model it is possible to create 6 control charts for the test data (One for each component and one for distance to model) (Figure \ref{fig:plainPCA}). The nice thing about this approach is that it allows the user to monitor drifts in several different dimensions. The problem is that, while some obvious outliers are visible, more subtle outliers related to combinations of several PCs does not stand out well. It is possible to get a better view using a scatterplot matrix, but that will still only visualise outliers defined by 2 PCs. Furthermore this will be at the expense of the time dimension which is paramount to identifying drifts.