Predicting Peptide-MHC Binding Affinities With Imputed Training Data


Predicting the binding affinity between MHC proteins and their peptide ligands is a key problem in computational immunology. State of the art performance is currently achieved by the allele-specific predictor NetMHC and the pan-allele predictor NetMHCpan, both of which are ensembles of shallow neural networks. We explore an intermediate between allele-specific and pan-allele prediction: training allele-specific predictors with synthetic samples generated by imputation of the peptide-MHC affinity matrix. We find that the imputation strategy is useful on alleles with very little training data. We have implemented our predictor as an open-source software package called MHCflurry and show that MHCflurry achieves competitive performance to NetMHC and NetMHCpan.


In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing infected or cancerous cells. Each organism possesses a poly-clonal army of T-cells which collectively are able to distinguish unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs) (Blackman 1990). Each distinct TCR recognizes a small number of similar peptides bound to an MHC molecule on the surface of a cell (Huseby 2005). Though there are many steps in “antigen processing” (Cresswell 2005), it has become apparent that MHC binding is the most restrictive step. Peptide-MHC affinity prediction is the well-studied problem of predicting the binding strength of a given peptide and MHC pair (Lundegaard 2007). Early approaches focused on “sequence motifs”(Sette 1989), followed by regularized linear models, linear models with interaction terms such as SMM with pairwise features (Peters 2003), and more recently the NetMHC family of predictors, a collection of related models based on ensembles of neural networks. Two of these predictors, NetMHC (Lundegaard 2008) and NetMHCpan (Nielsen 2007), have emerged as the methods of choice across multiple fields of study within immunology, including virology (Lund 2011), tumor immunology (Gubin 2015), and autoimmunity (Abreu 2012).

NetMHC is an allele-specific method which trains a separate predictor for each allele’s binding dataset, whereas NetMHCpan is a pan-allele method whose inputs are vector encodings of both a peptide and a subsequence of a particular MHC molecule. The conventional wisdom is that NetMHC performs better on alleles with many assayed ligands whereas NetMHCpan is superior for less well-characterized alleles (Gfeller 2016).

In this paper we explore the space between allele-specific and pan-allele prediction by imputing the unobserved values of peptide-MHC affinities for which we have no measurements and using these imputed values for pre-training of allele-specific binding predictors.