Materials and Methods

Data

A publicly available LC-MS/MS experiment using the UPS2-kit was used. The Universal Proteomics Standard 2 contains 48 human proteins with a molecular weight ranging from 6.000 to 83.000 Daltons. The proteins have a dynamic range of concentrations between 0,5 to 50.000 femtomole. The data is publicly available on the PRIDE repository with identifier PXD000331 . The dataset contains raw data exclusively from the UPS2-kit, but also the UPS2-kit in combination with micro-organisms such asMycoplasma pneumoniae , Drosophila melanogaster andLeptospira interrogans . For the purpose of the manuscript, only the raw data on the UPS2-kit was selected. In the experiment, the proteins in the UPS2-kit were enzymatically cleaved into peptides using trypsin. The peptide-mixture was separated using LC for 120 minutes prior to performing MS/MS with the LTQ Orbitrap Velos. The UPS2-kit was measured in duplicate, A11-12042.raw and A11-12043.raw. For more specific information about the experiment, we refer to the original article of Ahrné et al. .

Database search

Both duplicates were analyzed with the FragPipe graphical user interface (version 19.1), using the Thermo Fisher .RAW files as input. FragPipe incorporates the MSFragger database search engine (version 3.7) . The default workflow was used to process the data except for the following adjustments. A precursor mass tolerance of ±10 parts per million (ppm) and a fragment mass tolerance of ±5 ppm was specified. Carbamidomethylation of Cysteine was set as a fixed modification and oxidation of Methionine as a variable modification. Trypsin was specified as the digestion enzyme with up to 2 missed cleavages. MSBooster and Percolator were used for rescoring the PSM with an FDR of 0.01 using a reverse target-decoy approach. An FDR of 0.01 was selected to ensure a high-quality benchmark dataset created from the MSFragger identifications. The results were further investigated using R (Version 4.3.0) and RStudio (Version 2023.3.0.386) .

Benchmark dataset construction

The general workflow is shown in Figure 1. The Thermo Fisher .RAW files were converted into mzML-format using MSConvert (version 3.0.23051) . MSConvert had vendor specific peak picking enabled to centroid the spectra. The data was processed further using a custom written Python script (version 3.9) . The Python bindings of OpenMS (version 2.7.0) were used to process the mzML files , such as selecting the MS1 spectra, acquiring the peak information, retention times, etc. All PSMs from MSFragger were used to construct the dataset. The amount of possible isotopic peaks was set to the monoisotopic peak followed by up to 5 isotopic peaks. It should be noted that this was an arbitrary choice. To construct the extracted ion chromatogram (XIC), the error margin on the observed m/z for the PSM was set to 5ppm, and we opted for a 5 second window before the retention time of the PSM and 30 seconds after. The 5 second window before the retention time of the PSM was selected as 5 seconds was twice the maximum time between two MS1spectra. A window of 30 seconds after the retention time of the PSM was selected as Ahrné et al. enabled a dynamic exclusion of 30 seconds after sampling a precursor ion. Hence, it was possible that the peptide was still present in the following MS1 spectra for 30 seconds without being sampled again. The extracted isotope distributions with at least 2 peaks were compared with the theoretical isotope distributions acquired using BRAIN (version 1.44.0) by computing the spectral angle . The MS1 isotope distribution dataset with additional metadata was stored as an Excel-file. The algorithms and code are available on https://github.com‌/VilenneFrederique/MS1IsotopeDistributionsDatasetWorkflow.