Introduction
Continuous advancements from a technical point-of-view have made MS an
appealing technique in different research fields, for example
proteomics. In proteomics, researchers often rely on bottom-up
proteomics, cleaving the proteins and peptides in a sample using
digestion enzymes, e.g. trypsin, followed by LC-MS/MS. Different
subfields of research within proteomics have emerged, including
biomarker discovery , drug discovery , PTM research such as
phosphorylation , immunopeptidomics , quantitative proteomics , and many
more. The capability of MS to rapidly sequence peptides and proteins,
and to detect mutations and modifications with an incredible high
sensitivity makes it an appealing analytical tool to apply within a
clinical setting.
Coupled with quantitative proteomics, MS-based proteomics has the
potential redefine disease definitions at the molecular level and help
shift the current curative medicine towards personalized medicine .
However, current workflows are prone to experimental errors. Because of
these experimental errors, it is essential to make a formal comparison
of different proteomics techniques when creating a proteomics workflow.
In the laboratory, different techniques may easily be compared by
comparing the results from different laboratory techniques. From a
bioinformatics point-of-view, this is less straightforward. Different
algorithms, albeit for peptide identification, quantification, or
different purposes, are usually compared on available experimental
datasets. However, the comparison of algorithms on these experimental
datasets may not be truly justified. Griss et al. found in a large-scale
study done on the Proteomics Identifications Database (PRIDE) that on
average 75% of the spectra analyzed in a MS experiment remained
unidentified . Unidentified could mean three things: incorrectly
identified, correctly identified but below scoring thresholds and truly
unidentified. Hence, relying on public datasets with unknown proteomes
proposes challenges when comparing different bioinformatic tools.
Additionally, machine learning (ML) and deep learning (DL) algorithms
are becoming more popular in MS-based proteomics due to advancements in
the computational field and the availability of large amounts of
(training) data. As a consequence, these algorithms are now commonly
used in every processing step of mass spectrometry data. When performing
spectral clustering prior to analyzing the data, GLEAMS is a novel
algorithm that relies on neural networks . For the identification of
spectra, Ionbot and Casanovo are recent machine learning and deep
learning applications . Lastly, as a part of post-processing, the scores
from PSMs are almost always rescored using algorithms to increase the
amount peptide identifications. Commonly used ML and DL algorithms for
this purpose are Percolator , Prosit , MS2Rescore and
MSBooster . Other applications include, but are not limited to, the
prediction of MS2 peak intensities from peptide
sequences, e.g. using Prosit, MS2PIP or AlphaPeptDeep
, or retention time prediction, e.g. using AlphaPeptDeep or DeepLC . All
mentioned ML and DL applications have been developed using publicly
available datasets using annotated MS2 spectra. Their
usage in improving the identification of MS2 spectra
and PTMs has been extensively shown in literature.
Contrary to MS2-based research, MS1spectra contain information on multiple peptides with a corresponding
isotope distribution. This requires researchers to extract the isotope
distribution from specific regions of interest before analysis. Little
research has been done on extracting these isotope distributions,
causing a lack of MS1 standardized benchmark isotope
distribution datasets . In this work, we aim to develop a workflow to
extract the isotope distribution in a PSM data-driven manner and we
present the results in a standardized way. Our objective is to create a
database with annotated MS1 isotope distributions and
other relevant features, which can be used as a foundation to develop
new ML and DL applications in the future. To evaluate our workflow, we
analyzed the Universal Proteomics Standard 2 (UPS2) from Sigma-Aldrich
with state-of-the-art software and applied the workflow, presenting it
as a first MS1 benchmark dataset.