Discussion

During our research, the objective was to establish a systematic workflow and generate a high-quality dataset of MS1isotope distributions. To eliminate the inherent stochasticity associated with working on unknown proteomes, we utilized the UPS2 standard kit. As the UPS2 standard kit only contains known proteins, we know what proteins to search for, giving an increased reassurance in the identifications made by the database search algorithm. Additionally, the varying concentrations within the kit allows researchers to test the sensitivity of their newly developed tools.
The initial step in our research involved performing a database search on both UPS2 samples. To ensure the production of high-quality PSMs, we employed MSFragger with a reverse target-decoy approach, maintaining an FDR of 1%. An equal amount of PSMs was identified in both samples, and there was a high level of agreement between the peptide and protein identifications. Upon further investigation, it was found that the protein concentration is one of the most influential factors for protein identification. Specifically, proteins with lower concentrations in the UPS2 standard kit exhibited reduced coverage and overall detection probability. While this might seem like a logical finding, we do want to express the importance of it. When using an unknown proteome to evaluate different algorithms, the PSMs will be influenced by the concentrations of the peptides and proteins present in the sample. While there are many other factors influencing the probability of identifying proteins and peptides, such as the preprocessing of samples or the dynamic range of the LC-MS/MS device itself, it is an important point to consider and well described in literature .
Next, we used a workflow developed in-house to extract MS1 isotope distributions for the PSMs acquired by the MSFragger database search. A total of 138.111 peptide isotope distributions were acquired combined over both samples with at least 127.646 peptide isotope distributions having 2 or more peaks. There were more MS1 isotope distributions extracted from sample A11-12042 compared to sample A11-12043, which corresponds to sample A11-12042 having more PSMs in comparison to sample A11-12043. The spectral angle was used to check the similarity between the experimental isotope distributions and their expected theoretical isotope distributions computed by BRAIN. The spectral angle can take on values between 0 and 1.57, with values closer to 0 indicating a higher similarity between the experimental and theoretical isotope distributions . The bell shape of the distributions of the spectral angle scores in both samples lay close to 0, indicating a high similarity between theoretical and experimental isotope distributions (Figure 3). While the dataset still includes isotope distributions with a high spectral angle score, indicating a high dissimilarity between the theoretical and experimental isotope distributions, we opted to leave them in the dataset, as they may still serve as valuable input for training machine learning and deep learning models. There were 10.465 isotope distributions consisting of just the monoisotopic peak. There are currently no ways of validating these monoisotopic peaks MS1 spectra, that we are aware of. Their only legitimacy is that they have been extracted at approximately the same time as confidently identified PSMs and within the specified mass window. Lastly, the complete MS1 isotope distribution dataset consists out of 965 unique peptides based on their sequence, modifications and charge state. While the complete dataset is quite large, it is also limited to a set of unique UPS peptides. However, we believe that the workflow presented may be used in the future to extract more MS1 isotope distributions from proteome standards such as the large-scale ProteomeTools dataset .
In this manuscript, we provided a data-driven approach to extract MS1 isotope distributions of high-quality while presenting them in a standardized manner. The proposed workflow can be used in the future to further extend the benchmark dataset. The benchmark dataset itself provides an ideal foundation for the development of new bioinformatics tools in the future, such as new machine learning and deep learning model. These novel algorithms may further advance our understanding of the molecular underpinnings of disease pathology. All code and algorithms have been made available https://github.com/‍‍‌VilenneFrederique/MS1Isotope‌DistributionsDatasetWorkflow

Acknowledgements

This research was funded by Research Foundation – Flanders (FWO) under the “Beyond the Genome: Ethical Aspects of Large Cohort Studies” project (Case number G070722N).

Conflicts of interest

The authors have declared no conflicts of interest.