Abstract
Relative and absolute intensity-based protein quantification across cell
lines, tissue atlases, and tumour datasets is increasingly available in
public datasets. These atlases enable researchers to explore fundamental
biological questions, such as protein existence, expression location,
quantity, and correlation with RNA expression. Most studies provide MS1
feature-based label-free quantitative (LFQ) datasets; however, growing
numbers of isobaric tandem mass tags (TMT) datasets remain unexplored.
Here, we compare traditional intensity-based absolute quantification
(iBAQ) proteome abundance ranking to an analogous method using reporter
ion proteome abundance ranking with data from an experiment where LFQ
and TMT were measured on the same samples. This new TMT method
substitutes reporter ion intensities for MS1 feature intensities in the
iBAQ framework. Additionally, we compared LFQ-iBAQ values to TMT-iBAQ
values from two independent large-scale tissue atlas datasets (one LFQ
and one TMT) using robust bottom-up proteomic identification,
normalisation, and quantitation workflows.
Proteomics is a powerful tool for understanding the underlying biology
of cells and tissues. Large-scale cell lines, tumour datasets, or tissue
atlases enable researchers to ask fundamental questions about the
proteome, such as protein existence, expression location and correlation
with RNA expression [1-3]. The number of publicly available datasets
continues to expand every year [4], facilitating their reuse [5,
6] and integration into protein expression resources [7, 8].
Label-free intensity-based absolute quantification (iBAQ) is a robust
and common method to estimate the expression of proteins without the
need for a standard reference sample [9, 10]. This method measures
relative protein abundances within a sample and can be converted to
approximate absolute scales, like copy number when certain assumptions
are met. iBAQ protein expression has been only explored for the
label-free data-dependent (DDA) [9] and independent acquisition
(DIA) methods using MS1 [10].
MS2 methods [11, 12], such as spectral counting, can serve as a
proxy for absolute quantification in bottom-up proteomics experiments.
Spectral-counting algorithms offer some advantages because they can be
applied directly to the data commonly collected for identification
purposes including TMT (multiplex) experiments. In 2011, Colaert et. al.
[12] explored three MS2-based quantitative methods: Exponentially
modified Protein Abundance Index (EmPAI) [13], Normalized Spectral
Abundance Factor (NSAF) [14], and normalized Spectral Index (SIn)
[15]. Their findings indicated that the NSAF method outperformed
both EmPAI and SIn in terms of accuracy and precision [12]. However,
spectral counting-based quantification has limitations because it does
not use chromatography peak attributes such as height or area
potentially limiting its accuracy and dynamic range [16, 17]. Ahrné
et al. [18] undertook a distinct intensity-based strategy to
calculate iBAQ values in TMT datasets, treating them as label-free
datasets. This involved distributing MS1 intensities of all TMT-labelled
features among the individual samples based on the relative reporter ion
intensities. However, this approach is more complex, as the datasets
need to be analyzed as label-free experiments and precursor ion
intensities must be extracted. Furthermore, this approach has not been
applied to a large-scale dataset or benchmarked across different
datasets.
Here, we explored an alternative approach to perform absolute protein
expression analysis on TMT datasets using the direct reporter ion
intensities. To assess the accuracy of this method, we employed a
gold-standard mix-proteome dataset (PXD007683) [19] analyzed with
both LFQ and TMT methods. We then calculated iBAQ values based on either
MS1 feature or reporter ion intensities (respectively) and compared the
correlation for all quantified proteins. Additionally, we applied robust
normalization and quantitation workflows to analyze two large-scale
tissue datasets from Jian et al. (TMT – PXD016999) [1] and Wang et
al. (LFQ – PXD010154) [2].
Intensity-based absolute quantification (iBAQ) values were estimated
using the MS1 intensities for label-free experiments, and the reporter
ion intensities in the case of TMT datasets. Feature intensity tables
for all analyzed datasets were generated using the quantms
(https://quantms.readthedocs.io/ ) workflow which enables the
analysis of DDA, DIA label-free, and TMT datasets. Each generated
feature was the combination of a peptide sequence, modifications, charge
state, sample, fraction, and technical or biological replicate. Feature
intensities were normalized using quantile normalization, the highest
intensity for each feature was selected across replicates. Finally,
feature intensities were averaged (mean) at the peptide sequence level.
iBAQ is computed by dividing the sum of peptide intensities by the
number of theoretically observable peptides of the protein. Each iBAQ
value was normalized to the sum of all iBAQ values for the same sample
(riBAQ) [20, 21]. All analysis steps are included in a Python
package (https://github.com/bigbio/ibaqpy).
We tested the TMT-iBAQ approach using a mix-proteome dataset comprising
both Human and Yeast samples in multiple concentrations [19]. The
primary objective of the dataset and the original study was to evaluate
the capability of TMT and LFQ approaches in accurately quantifying fold
changes of 3-, 2-, and 1.5-fold across the entire dataset. All
parameters for the reanalysis were annotated using the SDRF file format
[22] (Supplementary Note 1 ). In the present study, we did
not explore the differential expression across samples (as originally
designed by O’Connell et. al. [19]) but compared the expression of
the Human proteins when using TMT-iBAQ or LFQ-iBAQ.
In the PXD007683 dataset, we quantified a total of 94,804 peptides and
8,401 proteins. There were 33,321 peptides and 6,273 proteins commonly
identified using TMT and LFQ approaches; while 18,524 peptides from 392
proteins and 42,959 peptides from 1,736 proteins were quantified using
only LFQ or TMT approaches, respectively. The peptide intensity between
both approaches is statistically significantly correlated for all
samples (R > 0.44, p-value < 2.2e-16 –Supplementary Note 2 ). The log-scale iBAQ values for both TMT
and LFQ approaches of the PXD007683 dataset were compared, as shown in
Figure 1A-B. First, we evaluated the reproducibility of the two methods
across all 11 sample replicates for both approaches (Figure
1A ). Samples analysed with the label-free method showed a higher
coefficient of variation (average CV = 15%), while TMT samples had an
average CV=11%. The iBAQ values displayed a similar distribution across
the 11 samples, with a higher median intensity observed for TMT
experiments than LFQ in all samples (Figure 1A). The iBAQ Pearson
correlation between the TMT and LFQ approaches is remarkably high (R
> 0.83, p-value < 2.2 e-16). These results
demonstrate that the iBAQ values obtained from both LFQ and TMT
approaches in this benchmark dataset are highly consistent and reliable.
In fact, this result is supported by the long use of MS2 (based on
fragment ion intensities) data for quantification in proteomics
experiments by using MRM, DIA or having found good correlations between
precursors and their reporters in DDA experiments [23].
While previous authors [16, 19, 24] have found that LFQ and TMT
methods offer similar performance in terms of accuracy when analysing
the same sample, comparisons of these methods for proteome
characterization between different studies with similar tissue remains
unexplored. We tested this in reanalysis of two large-scale human tissue
datasets from Jian et al. (TMT – PXD016999) [1] and Wang et al.
(LFQ – PXD010154) [2] (Supplementary Note 1 ). Both
datasets were analysed using the same database (UniProt human Swiss-Prot
092022), the quantms workflow, and the corresponding datasets parameters
(Supplementary Note 1 ). For PXD010154, a total number of
340,306 peptides and 14,602 proteins were quantified, while the number
of quantified peptides and proteins for PXD016999 were 173,678 and
10,351, respectively. Figure 2A shows the distribution of iBAQ values
for all shared tissues between both datasets (adrenal gland, liver,
lung, ovary, pancreas, prostate, spleen, stomach, and testis), while
median intensity is higher for TMT experiments compared with LFQ for all
tissues except prostate. Figure 2B shows the iBAQ correlation between
both experiments for the shared tissues, and all tissues show a
correlation coefficient higher than 0.80. The iBAQ values obtained by
LFQ and TMT of these 9 tissues had a strong correlation and high
consistency. Previously, Betancourt et. al. [25] integrated
TMT results with LFQ using the three most abundant peptides for each
protein quantified (TOP3), but the reproducibility and the correlation
between both technologies were never explored. Using the transformed
normalized intensities as suggested by Jiang et. al. [1], instead of
the iBAQ values from reporter ion intensities (as suggested in this
research), could negatively affect the correlation between relative
proteome abundances obtained with LFQ or TMT.
In summary, intensity-based absolute quantification (iBAQ), as
previously reported, is a robust and common method for estimating the
relative/absolute expression of proteins. This study explored and
extended the capabilities of the LFQ-iBAQ approach to perform
proteome-wide quantification in TMT datasets using direct reporter ion
intensities. The results showed that the iBAQ correlation between the
TMT and LFQ approaches in different datasets is high, indicating the
potential of the direct reporter ion intensity method for relative
protein abundance analyses in TMT datasets. This new approach can enable
the future integration public TMT and LFQ proteomics datasets using
intensity-based methods instead of less accurate spectral counting which
could improve the accuracy and reproducibility of proteomics
meta-analyses.