Authorea

Alex Rubinsteyn edited sectionContent_Text_.tex almost 8 years ago

Commit id: a603442c20aa59c280acdd819d68280d7c8533d6

deletions | additions

Initial approaches to predicting MHC ligands focused on ``sequence motifs''\cite{Sette_1989}, which were quickly replaced by a variety of regularized linear models, which themselves are consistently outperformed by regularized linear models with interaction terms such as SMM~\cite{Peters_2003}. The inexorable march toward black box non-linear models reached its local maximum with the NetMHC family of predictors, which are a collection of related models that utilize ensembles of neural networks. Two of these predictors in particular, NetMHC~\cite{Lundegaard_2008} and NetMHCpan~\cite{Nielsen_2007}, have emerged as the preferred methods for computational prediction of MHC ligands across several areas of immunology, including virology~\cite{Lund_2011}, tumor immunology~\cite{Gubin_2015}, and autoimmunity~\cite{Abreu_2012}. The primary difference between NetMHC and NetMHCpan is that the former is an `allele-specific'' ``allele-specific'' method which trains a separate predictor for each allele's binding dataset, whereas the latter is a ``pan-allele'' method whose inputs are vector encodings of both the peptide and a subset of MHC molecule's primary sequence. The conventional wisdom is that NetMHC performs better on alleles with many assayed ligands whereas NetMHCpan is superior for less well-characterized alleles~\cite{Gfeller_2016}. In this paper we explore the space between ``allele-specific'' and ``pan-allele'' prediction by imputing the unobserved values of peptide-MHC affinities for which we have no measurements and using these imputed values for pre-training of allele-specific binding predictors.