Abstract
Relative and absolute intensity-based protein quantification across cell lines, tissue atlases, and tumour datasets is increasingly available in public datasets. These atlases enable researchers to explore fundamental biological questions, such as protein existence, expression location, quantity, and correlation with RNA expression. Most studies provide MS1 feature-based label-free quantitative (LFQ) datasets; however, growing numbers of isobaric tandem mass tags (TMT) datasets remain unexplored. Here, we compare traditional intensity-based absolute quantification (iBAQ) proteome abundance ranking to an analogous method using reporter ion proteome abundance ranking with data from an experiment where LFQ and TMT were measured on the same samples. This new TMT method substitutes reporter ion intensities for MS1 feature intensities in the iBAQ framework. Additionally, we compared LFQ-iBAQ values to TMT-iBAQ values from two independent large-scale tissue atlas datasets (one LFQ and one TMT) using robust bottom-up proteomic identification, normalisation, and quantitation workflows.
Proteomics is a powerful tool for understanding the underlying biology of cells and tissues. Large-scale cell lines, tumour datasets, or tissue atlases enable researchers to ask fundamental questions about the proteome, such as protein existence, expression location and correlation with RNA expression [1-3]. The number of publicly available datasets continues to expand every year [4], facilitating their reuse [5, 6] and integration into protein expression resources [7, 8]. Label-free intensity-based absolute quantification (iBAQ) is a robust and common method to estimate the expression of proteins without the need for a standard reference sample [9, 10]. This method measures relative protein abundances within a sample and can be converted to approximate absolute scales, like copy number when certain assumptions are met. iBAQ protein expression has been only explored for the label-free data-dependent (DDA) [9] and independent acquisition (DIA) methods using MS1 [10].
MS2 methods [11, 12], such as spectral counting, can serve as a proxy for absolute quantification in bottom-up proteomics experiments. Spectral-counting algorithms offer some advantages because they can be applied directly to the data commonly collected for identification purposes including TMT (multiplex) experiments. In 2011, Colaert et. al. [12] explored three MS2-based quantitative methods: Exponentially modified Protein Abundance Index (EmPAI) [13], Normalized Spectral Abundance Factor (NSAF) [14], and normalized Spectral Index (SIn) [15]. Their findings indicated that the NSAF method outperformed both EmPAI and SIn in terms of accuracy and precision [12]. However, spectral counting-based quantification has limitations because it does not use chromatography peak attributes such as height or area potentially limiting its accuracy and dynamic range [16, 17]. Ahrné et al. [18] undertook a distinct intensity-based strategy to calculate iBAQ values in TMT datasets, treating them as label-free datasets. This involved distributing MS1 intensities of all TMT-labelled features among the individual samples based on the relative reporter ion intensities. However, this approach is more complex, as the datasets need to be analyzed as label-free experiments and precursor ion intensities must be extracted. Furthermore, this approach has not been applied to a large-scale dataset or benchmarked across different datasets.
Here, we explored an alternative approach to perform absolute protein expression analysis on TMT datasets using the direct reporter ion intensities. To assess the accuracy of this method, we employed a gold-standard mix-proteome dataset (PXD007683) [19] analyzed with both LFQ and TMT methods. We then calculated iBAQ values based on either MS1 feature or reporter ion intensities (respectively) and compared the correlation for all quantified proteins. Additionally, we applied robust normalization and quantitation workflows to analyze two large-scale tissue datasets from Jian et al. (TMT – PXD016999) [1] and Wang et al. (LFQ – PXD010154) [2].
Intensity-based absolute quantification (iBAQ) values were estimated using the MS1 intensities for label-free experiments, and the reporter ion intensities in the case of TMT datasets. Feature intensity tables for all analyzed datasets were generated using the quantms (https://quantms.readthedocs.io/ ) workflow which enables the analysis of DDA, DIA label-free, and TMT datasets. Each generated feature was the combination of a peptide sequence, modifications, charge state, sample, fraction, and technical or biological replicate. Feature intensities were normalized using quantile normalization, the highest intensity for each feature was selected across replicates. Finally, feature intensities were averaged (mean) at the peptide sequence level. iBAQ is computed by dividing the sum of peptide intensities by the number of theoretically observable peptides of the protein. Each iBAQ value was normalized to the sum of all iBAQ values for the same sample (riBAQ) [20, 21]. All analysis steps are included in a Python package (https://github.com/bigbio/ibaqpy).
We tested the TMT-iBAQ approach using a mix-proteome dataset comprising both Human and Yeast samples in multiple concentrations [19]. The primary objective of the dataset and the original study was to evaluate the capability of TMT and LFQ approaches in accurately quantifying fold changes of 3-, 2-, and 1.5-fold across the entire dataset. All parameters for the reanalysis were annotated using the SDRF file format [22] (Supplementary Note 1 ). In the present study, we did not explore the differential expression across samples (as originally designed by O’Connell et. al. [19]) but compared the expression of the Human proteins when using TMT-iBAQ or LFQ-iBAQ.
In the PXD007683 dataset, we quantified a total of 94,804 peptides and 8,401 proteins. There were 33,321 peptides and 6,273 proteins commonly identified using TMT and LFQ approaches; while 18,524 peptides from 392 proteins and 42,959 peptides from 1,736 proteins were quantified using only LFQ or TMT approaches, respectively. The peptide intensity between both approaches is statistically significantly correlated for all samples (R > 0.44, p-value < 2.2e-16 –Supplementary Note 2 ). The log-scale iBAQ values for both TMT and LFQ approaches of the PXD007683 dataset were compared, as shown in Figure 1A-B. First, we evaluated the reproducibility of the two methods across all 11 sample replicates for both approaches (Figure 1A ). Samples analysed with the label-free method showed a higher coefficient of variation (average CV = 15%), while TMT samples had an average CV=11%. The iBAQ values displayed a similar distribution across the 11 samples, with a higher median intensity observed for TMT experiments than LFQ in all samples (Figure 1A). The iBAQ Pearson correlation between the TMT and LFQ approaches is remarkably high (R > 0.83, p-value < 2.2 e-16). These results demonstrate that the iBAQ values obtained from both LFQ and TMT approaches in this benchmark dataset are highly consistent and reliable. In fact, this result is supported by the long use of MS2 (based on fragment ion intensities) data for quantification in proteomics experiments by using MRM, DIA or having found good correlations between precursors and their reporters in DDA experiments [23].
While previous authors [16, 19, 24] have found that LFQ and TMT methods offer similar performance in terms of accuracy when analysing the same sample, comparisons of these methods for proteome characterization between different studies with similar tissue remains unexplored. We tested this in reanalysis of two large-scale human tissue datasets from Jian et al. (TMT – PXD016999) [1] and Wang et al. (LFQ – PXD010154) [2] (Supplementary Note 1 ). Both datasets were analysed using the same database (UniProt human Swiss-Prot 092022), the quantms workflow, and the corresponding datasets parameters (Supplementary Note 1 ). For PXD010154, a total number of 340,306 peptides and 14,602 proteins were quantified, while the number of quantified peptides and proteins for PXD016999 were 173,678 and 10,351, respectively. Figure 2A shows the distribution of iBAQ values for all shared tissues between both datasets (adrenal gland, liver, lung, ovary, pancreas, prostate, spleen, stomach, and testis), while median intensity is higher for TMT experiments compared with LFQ for all tissues except prostate. Figure 2B shows the iBAQ correlation between both experiments for the shared tissues, and all tissues show a correlation coefficient higher than 0.80. The iBAQ values obtained by LFQ and TMT of these 9 tissues had a strong correlation and high consistency. Previously, Betancourt et. al. [25] integrated TMT results with LFQ using the three most abundant peptides for each protein quantified (TOP3), but the reproducibility and the correlation between both technologies were never explored. Using the transformed normalized intensities as suggested by Jiang et. al. [1], instead of the iBAQ values from reporter ion intensities (as suggested in this research), could negatively affect the correlation between relative proteome abundances obtained with LFQ or TMT.
In summary, intensity-based absolute quantification (iBAQ), as previously reported, is a robust and common method for estimating the relative/absolute expression of proteins. This study explored and extended the capabilities of the LFQ-iBAQ approach to perform proteome-wide quantification in TMT datasets using direct reporter ion intensities. The results showed that the iBAQ correlation between the TMT and LFQ approaches in different datasets is high, indicating the potential of the direct reporter ion intensity method for relative protein abundance analyses in TMT datasets. This new approach can enable the future integration public TMT and LFQ proteomics datasets using intensity-based methods instead of less accurate spectral counting which could improve the accuracy and reproducibility of proteomics meta-analyses.