Predicting Peptide-MHC Binding Affinities With Imputed Training Data


Predicting the binding affinity between MHC proteins and their peptide ligands is a key problem in computational immunology. State of the art performance is currently achieved by the allele-specific predictor NetMHC and the pan-allele predictor NetMHCpan, both of which are ensembles of shallow neural networks. We explore an intermediate between allele-specific and pan-allele prediction: training allele-specific predictors with synthetic samples generated by imputation of the peptide-MHC affinity matrix. We find that the imputation strategy is useful on alleles with very little training data. We have implemented our predictor as an open-source software package called MHCflurry and show that MHCflurry achieves competitive performance to NetMHC and NetMHCpan.


In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing infected or cancerous cells. Each organism possesses a poly-clonal army of T-cells which collectively are able to distinguish unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs) (Blackman 1990). Each distinct TCR recognizes a small number of similar peptides bound to an MHC molecule on the surface of a cell (Huseby 2005). Though there are many steps in “antigen processing” (Cresswell 2005), it has become apparent that MHC binding is the most restrictive step. Peptide-MHC affinity prediction is the well-studied problem of predicting the binding strength of a given peptide and MHC pair (Lundegaard 2007). Early approaches focused on “sequence motifs”(Sette 1989), followed by regularized linear models, linear models with interaction terms such as SMM with pairwise features (Peters 2003), and more recently the NetMHC family of predictors, a collection of related models based on ensembles of neural networks. Two of these predictors, NetMHC (Lundegaard 2008) and NetMHCpan (Nielsen 2007), have emerged as the methods of choice across multiple fields of study within immunology, including virology (Lund 2011), tumor immunology (Gubin 2015), and autoimmunity (Abreu 2012).

NetMHC is an allele-specific method which trains a separate predictor for each allele’s binding dataset, whereas NetMHCpan is a pan-allele method whose inputs are vector encodings of both a peptide and a subsequence of a particular MHC molecule. The conventional wisdom is that NetMHC performs better on alleles with many assayed ligands whereas NetMHCpan is superior for less well-characterized alleles (Gfeller 2016).

In this paper we explore the space between allele-specific and pan-allele prediction by imputing the unobserved values of peptide-MHC affinities for which we have no measurements and using these imputed values for pre-training of allele-specific binding predictors.

Data and evaluation metrics

Two datasets were used from a recent paper studying the relationship between training data and pMHC predictor accuracy(Kim 2014). The training dataset (BD2009) contained entries from IEDB (Salimi 2012) up to 2009 and the test dataset (BLIND) contained IEDB entries from between 2010 and 2013 which did not overlap with BD2009 (Table \ref{tab:datasets}).

Train (BD2009) and test (BLIND) dataset sizes.
Alleles IC50 Measurements Expanded 9mers
BD2009 106 137,654 470,170
BLIND 53 27,680 83,752


Throughout this paper we will evaluate a pMHC binding predictor using three different metrics:

  • AUC: Area under the ROC curve. Estimates the probability that a “strong binder” peptide (affinity \(\leq 500\)nM) will be given a stronger predicted affinity than one whose ground truth affinity is \(>500\)nM.

  • F\(_1\) score: Measures trade-off between sensitivity and specificity for predicting “strong binders” with affinities \(\leq 500\)nM.

  • Kendall’s \(\tau\): Rank correlation across the full spectrum of binding affinities.

Comparison of imputation algorithms as predictors

A dataset of peptide-MHC affinities for \(n\) peptides and \(a\) alleles may be thought of as a \(n \times a\) matrix where peptide/allele pairs without measurements are missing values. The task of predicting values at these positions is known as matrix completion or imputation (depending on the community and data source). We investigated the performance of several imputation algorithms as a standalone solution to the peptide-MHC affinity prediction problem. The algorithms considered were:

  • meanFill: Replace each missing pMHC binding affinity with the mean affinity for that allele. This is a very simple imputation method which serves as a baseline against which other methods can be compared.

  • knnImpute (Troyanskaya 2001): Each missing entry \(X_{ij}\) is imputed using the values in the \(k\) closest columns with observation in row \(i\). Similarity between alleles is computed as \(e^{-d_{st}^2}\), where \(d_{st}\) is the mean squared difference between observed entries of alleles \(s\) and \(t\).

  • svdImpute (Troyanskaya 2001): Imputation using iterative fixed rank SVD decomposition.

  • softImpute (Mazumder 2010): A singular value thresholding method which iteratively estimates a low-rank matrix completion without forcing the pre-specification of a particular solution rank. Instead, the softImpute method is parameterized by a shrinkage value \(\lambda\) that is subtracted from each singular value.

  • MICE (Azur 2011): Average multiple imputations generated using Gibbs sampling from the joint distribution of columns.

We evaluated the performance of these methods using three-fold cross validation on BD2009, only considering peptides which occurred in at least three alleles and excluding alleles with less than five measurements (Table \ref{tab:imputation}). All imputation methods were implemented in the fancyimpute Python library (Feldman 2016). Since MICE outperformed the other methods on two of the three predictor metrics, we selected it for the subsequent neural network experiments.

Cross-validation performance of imputation algorithms on BD2009 dataset