Predicting the binding affinity between MHC proteins and their peptide ligands is a key problem in computational immunology. State of the art performance is currently achieved by the allele-specific predictor NetMHC and the pan-allele predictor NetMHCpan, both of which are ensembles of shallow neural networks. We explore an intermediate between allele-specific and pan-allele prediction: training allele-specific predictors with synthetic samples generated by imputation of the peptide-MHC affinity matrix. We find that the imputation strategy is useful on alleles with very little training data. We have implemented our predictor as an open-source software package called MHCflurry and show that MHCflurry achieves competitive performance to NetMHC and NetMHCpan.
Predicting the binding affinity between peptides (short amino acid sequences) and MHC proteins has emerged as a central problem in computational immunology due to its importance in determining the targets of T-cell mediated immune activity. An individual’s poly-clonal collection of T-cells is able to kill infected and cancerous cells while protecting healthy ones. This amazing feat is achieved through the winnowing and expansion of T-cell sub-populations possessing highly specific T-cell receptors (TCRs) . A distinct T-cell receptor recognizes a small number of similar peptides bound to an MHC molecule on the surface of a cell . Peptide-MHC binding is one of the most restrictive steps in “antigen processing” and is thus essential for determining which amino acid sequences can potentially trigger various T-cell responses. Early approaches to peptide-MHC binding prediction focused on “sequence motifs”, followed by regularized linear models, linear models with interaction terms such as SMM with pairwise features . More recently, methods based on ensembles of shallow neural networks have become common tools in computational virology , tumor immunology , and autoimmunity . Existing predictors work by encoding amino acid sequences as fixed length vectors using predefined amino acid features. In this poster we delineate several flavors of the peptide-MHC binding problem (i.e. allele-specific vs. pan-allele) and present the following improvements over the current generation of peptide-MHC binding predictors: - Learning vector embeddings for amino acids as part of training instead of using predefined features. - Generating synthetic data using imputation to train models for alleles with few training samples. - Replacing fixed length vector encodings with recurrent neural networks to make better predictions across a broader range of sequence lengths.
INTRODUCTION Dataset Two datasets were used from a recent paper studying the relationship between training data and pMHC predictor accuracyThe training dataset (BD2009) contained entries from IEDB up to 2009 and the test dataset (BLIND) contained IEDB entries from between 2010 and 2013 which did not overlap with BD2009 (Table -------- --------- ------------------- ---------------- -- Alleles IC50 Measurements Expanded 9mers BD2009 106 137,654 470,170 BLIND 53 27,680 83,752 -------- --------- ------------------- ---------------- -- : Train (BD2009) and test (BLIND) dataset sizes. Throughout this paper we will evaluate using AUC “Area under the ROC curve”, which estimates the probability that a “strong binder” peptide (affinity ≤500nM) will be given a stronger predicted affinity than one whose ground truth affinity is >500nM. State-of-the-art The state-of-the-art model for netMHC is a shallow neural network containing (1) an embedding layer which transforms amino acids to learned vector representations, (2) a single hidden layer with tanh nonlinearity, (3) a sigmoidal scalar output. Performance: AUC = Challenge of varying length peptide encoding Generally there are two ways to encode peptide: 1) hotshot encoding or 2) index encoding followed by an embedding. Most models such as shallow neural networks rely on fixed length peptide encoding. Encoding varying length peptide into fixed length peptide sequences is delicate as one might lose important information from the dataset. Padding with zeros the encoded peptide until reaching the maximal peptide length within the dataset, does not work well as last peptide positions then have varying positions. For shallow neural network models one has come up with the so-called “KMER INDEX ENCODING”: it uses fixed length 9mer inputs which requires peptides of other lengths to be transformed into multiple 9mers. Shorter peptides are mapped onto 9mer query peptides by introducing a sentinel “X” at every possible position, whereas longer peptides are cut into multiple 9mers by removing consecutive stretches of residues. The predicted affinity for a non-9mer peptide is the geometric mean of the predictions for the generated 9mers. When n training samples derive from a single non-9mer sequence then their weights are adjusted to 1/n. This is referred to as kmer-index-encoding Label encoding We map IC50 concentrations onto a regression targets between 0.0 and 1.0 using the same scheme as NetMHC, y = 1.0 − min(1.0, log₅₀₀₀₀(IC50)). LSTM MODELS Motivation LSTM are a natural candidate for this problem as they are able to handle sequences of varying length without having to rely on some encoding scheme such as “kmer index encoding”. One might argue that “kmer index encoding” despite giving reasonable performance, does not use the full power of the dataset. Hyperparameter tuning Learning rate is the most important hyperparameter to tune for LSTMs, followed by the hidden size (reference). Learning rate can be tuned independently of other paramaters, thus in order to save computing time, one can tune learning rate with very small models, and thus extrapolate to larger models. We found that for ‘optimizer = ’adam’ ‘, the learning rate decay should be 1/(1 + N)² with initial value 0.01, where N is the number of epochs. Final LSTM model We chose a bidirectional LSTM with hidden size of 50, and preceeded by an embedding layer of output dimension 32. The output of the two LSTMs are merged via mode = concat, followed by a dense sigmoid layer. The data is encoded via simple index encoding and the labels are mapped onto the interval [0,1] as explained in the previous section. LSTM vs FFNN Despite being able to handle sequences of varying length, LSTMs perform worse than FFNN. (image: linear vs sigmoid vs FFNN vs LSTM) Segmenting the test set into 9 mers and non 9 mers, one notices that LSTMs worse performance stems from its worse performance on non 9 mers, while FFNN and LSTM perform about the same on 9 mers. (2 images: linear vs sigmoid vs FFNN vs LSTM) Quite surprisingly a simple sigmoid layer does a better job on non 9 mers than LSTM. This either means that kmer index encoding is doing particularly well, or LSTM is doing particularly bad. (If we are able to get the LSTM to mimick the kmer index encoding‘ we have a good chance of improving performance on non 9 mers. ) HINTS FOR IMPROVEMENT kmer-index-encoded LSTM does a better job (image) LSTM performs worse on longer sequences (image)
This Authorea document template can be used to prepare documents according to a desired citation style and authoring guidelines. Abstracts are not always required, but most academic papers have one and writers should know how to produce a useful abstract. An abstract should be a very short, clear and concise summation of the entire paper. An abstract should provide enough of a preview that a typical reader will know whether or not they wish to read the paper. It should reveal both the purpose and conclusions of the paper.