Predicting Peptide-MHC Binding Affinities With Imputed Training Data


Predicting the binding affinity between MHC proteins and their peptide ligands is a key problem in computational immunology. State of the art performance is currently achieved by the allele-specific predictor NetMHC and the pan-allele predictor NetMHCpan, both of which are ensembles of shallow neural networks. We explore an intermediate between allele-specific and pan-allele prediction: training allele-specific predictors with synthetic samples generated by imputation of the peptide-MHC affinity matrix. We find that the imputation strategy is useful on alleles with very little training data. We have implemented our predictor as an open-source software package called MHCflurry and show that MHCflurry achieves competitive performance to NetMHC and NetMHCpan.


In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing infected or cancerous cells. Each organism possesses a poly-clonal army of T-cells which collectively are able to distinguish unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs) (Blackman 1990). Each distinct TCR recognizes a small number of similar peptides bound to an MHC molecule on the surface of a cell (Huseby 2005). Though there are many steps in “antigen processing” (Cresswell 2005), it has become apparent that MHC binding is the most restrictive step. Peptide-MHC affinity prediction is the well-studied problem of predicting the binding strength of a given peptide and MHC pair (Lundegaard 2007). Early approaches focused on “sequence motifs”(Sette 1989), followed by regularized linear models, linear models with interaction terms such as SMM with pairwise features (Peters 2003), and more recently the NetMHC family of predictors, a collection of related models based on ensembles of neural networks. Two of these predictors, NetMHC (Lundegaard 2008) and NetMHCpan (Nielsen 2007), have emerged as the methods of choice across multiple fields of study within immunology, including virology (Lund 2011), tumor immunology (Gubin 2015), and autoimmunity (Abreu 2012).

NetMHC is an allele-specific method which trains a separate predictor for each allele’s binding dataset, whereas NetMHCpan is a pan-allele method whose inputs are vector encodings of both a peptide and a subsequence of a particular MHC molecule. The conventional wisdom is that NetMHC performs better on alleles with many assayed ligands whereas NetMHCpan is superior for less well-characterized alleles (Gfeller 2016).

In this paper we explore the space between allele-specific and pan-allele prediction by imputing the unobserved values of peptide-MHC affinities for which we have no measurements and using these imputed values for pre-training of allele-specific binding predictors.

Data and evaluation metrics

Two datasets were used from a recent paper studying the relationship between training data and pMHC predictor accuracy(Kim 2014). The training dataset (BD2009) contained entries from IEDB (Salimi 2012) up to 2009 and the test dataset (BLIND) contained IEDB entries from between 2010 and 2013 which did not overlap with BD2009 (Table \ref{tab:datasets}).

Train (BD2009) and test (BLIND) dataset sizes.
Alleles IC50 Measurements Expanded 9mers
BD2009 106 137,654 470,170
BLIND 53 27,680 83,752


Throughout this paper we will evaluate a pMHC binding predictor using three different metrics:

  • AUC: Area under the ROC curve. Estimates the probability that a “strong binder” peptide (affinity \(\leq 500\)nM) will be given a stronger predicted affinity than one whose ground truth affinity is \(>500\)nM.

  • F\(_1\) score: Measures trade-off between sensitivity and specificity for predicting “strong binders” with affinities \(\leq 500\)nM.

  • Kendall’s \(\tau\): Rank correlation across the full spectrum of binding affinities.

Comparison of imputation algorithms as predictors

A dataset of peptide-MHC affinities for \(n\) peptides and \(a\) alleles may be thought of as a \(n \times a\) matrix where peptide/allele pairs without measurements are missing values. The task of predicting values at these positions is known as matrix completion or imputation (depending on the community and data source). We investigated the performance of several imputation algorithms as a standalone solution to the peptide-MHC affinity prediction problem. The algorithms considered were:

  • meanFill: Replace each missing pMHC binding affinity with the mean affinity for that allele. This is a very simple imputation method which serves as a baseline against which other methods can be compared.

  • knnImpute (Troyanskaya 2001): Each missing entry \(X_{ij}\) is imputed using the values in the \(k\) closest columns with observation in row \(i\). Similarity between alleles is computed as \(e^{-d_{st}^2}\), where \(d_{st}\) is the mean squared difference between observed entries of alleles \(s\) and \(t\).

  • svdImpute (Troyanskaya 2001): Imputation using iterative fixed rank SVD decomposition.

  • softImpute (Mazumder 2010): A singular value thresholding method which iteratively estimates a low-rank matrix completion without forcing the pre-specification of a particular solution rank. Instead, the softImpute method is parameterized by a shrinkage value \(\lambda\) that is subtracted from each singular value.

  • MICE (Azur 2011): Average multiple imputations generated using Gibbs sampling from the joint distribution of columns.

We evaluated the performance of these methods using three-fold cross validation on BD2009, only considering peptides which occurred in at least three alleles and excluding alleles with less than five measurements (Table \ref{tab:imputation}). All imputation methods were implemented in the fancyimpute Python library (Feldman 2016). Since MICE outperformed the other methods on two of the three predictor metrics, we selected it for the subsequent neural network experiments.

Cross-validation performance of imputation algorithms on BD2009 dataset
Imputation Method Parameter AUC \(F_1\) score Kendall’s \(\tau\)
meanFill 0.67665 0.04950 0.17675
knnImpute \(k = 1\) 0.80907 0.57952 0.40201
\(k = 3\) 0.83189 0.57594 0.42086
\(k = 5\) 0.83103 0.56118 0.41703
MICE \(n = 25\) 0.85861 0.57597 0.44978
\(n = 50\) 0.86127 0.56527 0.45944
softImpute \(\lambda=5\) 0.78981 0.39158 0.33408
\(\lambda=10\) 0.83248 0.53575 0.39763
\(\lambda=20\) 0.85608 0.60599 0.43754
svdImpute rank = 5 0.82305 0.57040 0.39117
rank = 10 0.83667 0.58433 0.40048
rank = 20 0.82986 0.57038 0.38817


Neural network architecture

Each MHCflurry predictor is a feed-forward neural network containing (1) an embedding layer which transforms amino acids to learned vector representations, (2) a single hidden layer with \(tanh\) nonlinearity, (3) a sigmoidal scalar output. This network is implemented using Keras (Chollet 2015).

Three-fold cross validation on the training set was used to select the hyper-parameters. The best model had 32 output dimensions for the amino acid vector embedding, a hidden layer size of 64, a dropout rate of 50%, and 250 training epochs. These hyper-parameters achieved reasonable performance across alleles, but it’s likely that performance could be further improved by setting the hyper-parameters separately for each allele.

Data encoding

Like the NetMHC family of predictors (Lundegaard 2008), MHCflurry uses fixed length 9mer inputs which requires peptides of other lengths to be transformed into multiple 9mers. Shorter peptides are mapped onto 9mer query peptides by introducing a sentinel “X” at every possible position, whereas longer peptides are cut into multiple 9mers by removing consecutive stretches of residues. The predicted affinity for a non-9mer peptide is the geometric mean of the predictions for the generated 9mers. When \(n\) training samples derive from a single non-9mer sequence then their weights are adjusted to \(1/n\).

We map IC50 concentrations onto a regression targets between 0.0 and 1.0 using the same scheme as NetMHC, \(y = 1.0 - max(1.0, log_{50000}(IC50))\).


For each allele, we train a MHCflurry model using the measured peptide affinities for the allele and the values imputed by MICE based on other alleles in the training set. As training progresses, we place quadratically decreasing weight on the imputed values.

A randomly generated peptide is unlikely to bind a given MHC strongly, but a data acquisition bias toward strong binders in the training set can lead models to assign a high affinity to most peptides. As a form of regularization, we augment the training set at each epoch to include random peptides with affinity set to be maximally weak. The number of random negative peptides is 20% of the training size (without imputation). At each training epoch, a fresh set of random peptides is generated.


We evaluated the effect of imputation by drawing subsets of the BD2009 training set for the well-characterized allele HLA-A*02:01. Predictors were trained on a range of simulated training set sizes and tested on the remaining data (Figure \ref{fig:imputecomparison}). We find that imputation gives a modest improvement up to approximately 100 training samples. With more training data there is no benefit to imputation.

MHCflurry performance on down-sampled training data for HLA-A*02:01 with and without imputation


We then compared the performance of MHCflurry against NetMHC, NetMHCpan, and SMM on the blind test data. The MHCflurry ensemble model contains 32 predictors initialized with different random weights. The MHCflurry ensemble is competitive with NetMHC and NetMHCpan.

Performance on BLIND dataset
AUC \(F_1\) score Kendall’s \(\tau\)
MHCflurry (ensemble) 0.93260 0.78459 0.58686
MHCflurry (single predictor) 0.93225 0.78106 0.58572
NetMHC 0.93234 0.80722 0.58633
NetMHCpan 0.93264 0.79957 0.58138
SMM-PMBEC 0.92134 0.79026 0.56488



Imputing training data shows promise in cross-validation as a way to improve performance on alleles with few observations, but only seems to help for very small training sizes (\(\leq 100\)). Unfortunately, none of the alleles included in the BLIND dataset had fewer than 100 samples in BD2009, and only one had fewer than 200. Thus, additional work is required to assess the accuracy of MHCflurry and other predictors on alleles with scarce training data. Additionally, we need to further investigate the interaction between imputation parameters, the decay schedule for the weights of imputed samples, and stopping criteria for training individual allele-specific predictors.


MHCflurry is available at The data, scripts, and notebooks used to generate the plots and tables in this paper are available at


  1. M. K. Anderson, R. Pant, A. L. Miracle, X. Sun, C. A. Luer, C. J. Walsh, J. C. Telfer, G. W. Litman, E. V. Rothenberg. Evolutionary Origins of Lymphocytes: Ensembles of T Cell and B Cell Transcriptional Regulators in a Cartilaginous Fish. The Journal of Immunology 172, 5851–5860 The American Association of Immunologists, 2004. Link

  2. M Blackman, J Kappler, P Marrack. The role of the T cell receptor in positive and negative selection of developing T cells. Science 248, 1335–1341 American Association for the Advancement of Science (AAAS), 1990. Link

  3. Eric S. Huseby, Janice White, Frances Crawford, Tibor Vass, Dean Becker, Clemencia Pinilla, Philippa Marrack, John W. Kappler. How the T Cell Repertoire Becomes Peptide and MHC Specific. Cell 122, 247–260 Elsevier BV, 2005. Link

  4. Peter Cresswell, Anne L. Ackerman, Alessandra Giodini, David R. Peaper, Pamela A. Wearsch. Mechanisms of MHC class I-restricted antigen processing and cross-presentation. Immunol Rev 207, 145–157 Wiley-Blackwell, 2005. Link

  5. C. Lundegaard, O. Lund, C. Kesmir, S. Brunak, M. Nielsen. Modeling the adaptive immune system: predictions and simulations. Bioinformatics 23, 3265–3275 Oxford University Press (OUP), 2007. Link

  6. A. Sette, S. Buus, E. Appella, J. A. Smith, R. Chesnut, C. Miles, S. M. Colon, H. M. Grey. Prediction of major histocompatibility complex binding regions of protein antigens by sequence pattern analysis.. Proceedings of the National Academy of Sciences 86, 3296–3300 Proceedings of the National Academy of Sciences, 1989. Link

  7. B. Peters, W. Tong, J. Sidney, A. Sette, Z. Weng. Examining the independent binding assumption for binding of peptide epitopes to MHC-I molecules. Bioinformatics 19, 1765–1772 Oxford University Press (OUP), 2003. Link

  8. C. Lundegaard, K. Lamberth, M. Harndahl, S. Buus, O. Lund, M. Nielsen. NetMHC-3.0: accurate web accessible predictions of human mouse and monkey MHC class I affinities for peptides of length 8-11. Nucleic Acids Research 36, W509–W512 Oxford University Press (OUP), 2008. Link

  9. Morten Nielsen, Claus Lundegaard, Thomas Blicher, Kasper Lamberth, Mikkel Harndahl, Sune Justesen, Gustav Røder, Bjoern Peters, Alessandro Sette, Ole Lund, Søren Buus. NetMHCpan a Method for Quantitative Predictions of Peptide Binding to Any HLA-A and -B Locus Protein of Known Sequence. PLoS ONE 2, e796 Public Library of Science (PLoS), 2007. Link

  10. Ole Lund, Eduardo J. M. Nascimento, Milton Maciel, Morten Nielsen, Mette Voldby Larsen, Claus Lundegaard, Mikkel Harndahl, Kasper Lamberth, Søren Buus, Jérôme Salmon, Thomas J. August, Ernesto T. A. Marques. Human Leukocyte Antigen (HLA) Class I Restricted Epitope Discovery in Yellow Fewer and Dengue Viruses: Importance of HLA Binding Strength. PLoS ONE 6, e26494 Public Library of Science (PLoS), 2011. Link

  11. Matthew M. Gubin, Maxim N. Artyomov, Elaine R. Mardis, Robert D. Schreiber. Tumor neoantigens: building a framework for personalized cancer immunotherapy. Journal of Clinical Investigation 125, 3413–3421 American Society for Clinical Investigation, 2015. Link

  12. J. R. F. Abreu, S. Martina, A. A. Verrijn Stuart, Y. E. Fillié, K. L. M. C. Franken, J. W. Drijfhout, B. O. Roep. CD8 T cell autoreactivity to preproinsulin epitopes with very low human leucocyte antigen class I binding affinity. Clinical & Experimental Immunology 170, 57–65 Wiley-Blackwell, 2012. Link

  13. David Gfeller, Michal Bassani-Sternberg, Julien Schmidt, Immanuel F. Luescher. Current tools for predicting cancer-specific T cell immunity. OncoImmunology 00–00 Informa UK Limited, 2016. Link

  14. Yohan Kim, John Sidney, Søren Buus, Alessandro Sette, Morten Nielsen, Bjoern Peters. Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions. BMC Bioinformatics 15, 241 Springer Science \(\mathplus\) Business Media, 2014. Link

  15. Nima Salimi, Ward Fleri, Bjoern Peters, Alessandro Sette. The immune epitope database: a historical retrospective of the first decade. Immunology 137, 117–123 Wiley-Blackwell, 2012. Link

  16. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 Oxford University Press (OUP), 2001. Link

  17. Rahul Mazumder, Trevor Hastie, Robert Tibshirani. Spectral Regularization Algorithms for Learning Large Incomplete Matrices. The Journal of Machine Learning Research 11, 2287–2322, 2010. Link

  18. Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, Philip J. Leaf. Multiple imputation by chained equations: what is it and how does it work?. International Journal of Methods in Psychiatric Research 20, 40–49 Wiley-Blackwell, 2011. Link

  19. Sergey Feldman, Alex Rubinsteyn. fancyimpute: Version 0.0.16. (2016). Link

  20. François Chollet. keras. GitHub repository GitHub, 2015.

  21. Claus Lundegaard, Ole Lund, Morten Nielsen. Accurate approximation method for prediction of class I MHC affinities for peptides of length 8, 10 and 11 using prediction tools trained on 9mers. Bioinformatics 24, 1397–1398 Oxford Univ Press, 2008.

[Someone else is editing this]

You are editing this file