Mass spectrometry based proteomics (from here on proteomics) has greatly improved the coverage of protein analyses, compared to older gel-based methods. Usually the mass spectrometer is coupled to a liquid chromatograph for better separation of the complex samples by means of an electrospray ionisator (ESI). This setup (usually refered to as LC-MS/MS), though powerful, is subject to a lot of unstability. Sample-to-sample, instrument-to-instrument as well as day-to-day variations are a sad fact of doing proteomics. Despite of this, rigorous quality control have yet to become a standardized part of proteomic pipelines. Sample-to-sample variations are usually assessed after the data acquisation has ended and, if not too severe, normalisation is employed to circumvent it. Standard practise is usually to only compare samples from the same run in order to avoid instrument-to-instrument as well as day-to-day variation in the data, but while this approach is statisitically sound it removes the possibility of collecting data from multiple sources as well as tracking the performance of the equipment over time.
Some effort have been put into the area, mostly centered around defining metrics that can be extracted from raw data files and used to monitor the different aspect of the instrumentation. The first iteration of this effort (Rudnick et al., 2010) defined 46 different metrics that could be extracted from a raw LC-MS/MS file and used to trace subtle variations back to different parts of the instrumentation. Recently QuaMeter (Ma et al., 2012) refined these and was used to compare data from several different laboratories (Wang et al., 2014). For the latter study robust PCA was used to investigate the different samples, but apart from this no foray into more advanced computational data analysis algorithms has been attempted. Furthermore the study, while adressing the need for on-line quality control, only investigated the variability of samples in a post-analysis manner.
This paper will investigate the use of other algorithms on the same data as employed by Wang et al. (2014) with the aim of finding a data-analytical approach that will be well-suited for automatic continuous quality control of LC-MS/MS equipment for both within- and between-run monitoring. As such it will forego the problem of instrument-to-instrument variation since this has already been adressed, and because this would normally not be done in an automated manner anyway.
The data used in this study has been provided by Wang et al. (2014) and match that used in their paper. It is a collection of metrics extracted using QuaMeter from samples across a range of US laboratories (Vanderbilt University Medical Center, Pacific Northwest National Laboratory, Broad Institute and John Hopkins University). As example data the Velos data from Vanderbilt University Medical Center was used as it constituted the most samples over a long period of time. All samples were divided into runs by looking at the time difference between the sample and the next. A time difference exceeding 2 hours constituted the start of a new run. Using this approach 37 runs were identified in the dataset with a median size of 15 samples. Two of the runs only included one sample and were subsequently removed. In each run the first, middle and last sample were assigned to be standards used to monitor between run variation. A stable instrument period between February 25th and April 15th 2013 were identified and the samples from that period was used as a training set for between run variation. The training set thus included 89 samples. For within run analysis run 6 was chosen (August 31th 2012 ff) as it constituted 15 samples including one obvious and a few subtle outlier samples.
All analyses have been done in the statistical computing environment R (Team, 2014) using additional packages that will get referenced accordingly when described. The code used for performing the analyses is available in the appendix. The one exception is for the calculation of Angle Based Outlier Detection which was done using ELKI (Achtert et al., 2013) as this software contained the only known implementation.
Outlier detection approaches can be classified in many ways. One prevalent way is looking at it either as a one-class learning problem or a missing label learning problem. In the former a set of good samples are used to define a model for conforming samples. Outliers are then samples deviating from this model. In the latter all samples are considered as missing a label (outlier or non-outlier) and prior assumptions abo