Introduction

Continuous advancements from a technical point-of-view have made MS an appealing technique in different research fields, for example proteomics. In proteomics, researchers often rely on bottom-up proteomics, cleaving the proteins and peptides in a sample using digestion enzymes, e.g. trypsin, followed by LC-MS/MS. Different subfields of research within proteomics have emerged, including biomarker discovery , drug discovery , PTM research such as phosphorylation , immunopeptidomics , quantitative proteomics , and many more. The capability of MS to rapidly sequence peptides and proteins, and to detect mutations and modifications with an incredible high sensitivity makes it an appealing analytical tool to apply within a clinical setting.
Coupled with quantitative proteomics, MS-based proteomics has the potential redefine disease definitions at the molecular level and help shift the current curative medicine towards personalized medicine . However, current workflows are prone to experimental errors. Because of these experimental errors, it is essential to make a formal comparison of different proteomics techniques when creating a proteomics workflow. In the laboratory, different techniques may easily be compared by comparing the results from different laboratory techniques. From a bioinformatics point-of-view, this is less straightforward. Different algorithms, albeit for peptide identification, quantification, or different purposes, are usually compared on available experimental datasets. However, the comparison of algorithms on these experimental datasets may not be truly justified. Griss et al. found in a large-scale study done on the Proteomics Identifications Database (PRIDE) that on average 75% of the spectra analyzed in a MS experiment remained unidentified . Unidentified could mean three things: incorrectly identified, correctly identified but below scoring thresholds and truly unidentified. Hence, relying on public datasets with unknown proteomes proposes challenges when comparing different bioinformatic tools.
Additionally, machine learning (ML) and deep learning (DL) algorithms are becoming more popular in MS-based proteomics due to advancements in the computational field and the availability of large amounts of (training) data. As a consequence, these algorithms are now commonly used in every processing step of mass spectrometry data. When performing spectral clustering prior to analyzing the data, GLEAMS is a novel algorithm that relies on neural networks . For the identification of spectra, Ionbot and Casanovo are recent machine learning and deep learning applications . Lastly, as a part of post-processing, the scores from PSMs are almost always rescored using algorithms to increase the amount peptide identifications. Commonly used ML and DL algorithms for this purpose are Percolator , Prosit , MS2Rescore and MSBooster . Other applications include, but are not limited to, the prediction of MS2 peak intensities from peptide sequences, e.g. using Prosit, MS2PIP or AlphaPeptDeep , or retention time prediction, e.g. using AlphaPeptDeep or DeepLC . All mentioned ML and DL applications have been developed using publicly available datasets using annotated MS2 spectra. Their usage in improving the identification of MS2 spectra and PTMs has been extensively shown in literature.
Contrary to MS2-based research, MS1spectra contain information on multiple peptides with a corresponding isotope distribution. This requires researchers to extract the isotope distribution from specific regions of interest before analysis. Little research has been done on extracting these isotope distributions, causing a lack of MS1 standardized benchmark isotope distribution datasets . In this work, we aim to develop a workflow to extract the isotope distribution in a PSM data-driven manner and we present the results in a standardized way. Our objective is to create a database with annotated MS1 isotope distributions and other relevant features, which can be used as a foundation to develop new ML and DL applications in the future. To evaluate our workflow, we analyzed the Universal Proteomics Standard 2 (UPS2) from Sigma-Aldrich with state-of-the-art software and applied the workflow, presenting it as a first MS1 benchmark dataset.