Authorea

Alex Rubinsteyn Merge branch 'master' of github.com:hammerlab/mhcflurry-icml-compbio-2016 almost 8 years ago

Commit id: d24c72094b510b4e322e85d469144c1aa3eb4e79

deletions | additions

Predicting the binding affinity between MHC proteins their peptide ligands is a key problem in computational immunology. State of the art performance is currently achieved by the single allele allele-specific predictor NetMHC and the pan-allele predictor netMHCpan, NetMHCpan, both of which are ensembles of shallow neural networks. We explore an intermediate between single allele allele-specific and pan-allelic pan-allele prediction: an training allele-specific predictor trained predictors with data imputed from similar alleles. synthetic samples generated by imputation of peptide-MHC affinity matrix. We find that the imputation strategy is useful on alleles with very little training data (under 100 examples). data. We implement this idea in have implemented our predictor as an open-source software package called MHCflurry, MHCflurry and show that MHCflurry achieves competitive performance. performance to NetMHC and NetMHCpan.

\section{Introduction} %In organisms with adaptive immunity, cells expose fragments of proteins extracted from the cytosol. These peptide fragments, typically but not always 9 amino acids in length, are monitored by cytotoxic T cells, which recognize and kill infected or cancerous cells based on their viral, bacterial, mutant, or unusual peptides~\cite{Anderson_2004}. Presented peptides must bind to a major histocompatability complex I (MHC I) protein, which forms a platform for interaction with T cells. In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing both infected and cancerous cells. Each individual organism possesses a poly-clonal army of T-cells which collectively are able to distinguish unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs)~\cite{Blackman_1990}. Each distinct TCR recognizes a small number of similar peptides bound to an MHC molecule on the surface of a cell~\cite{Huseby_2005}. Though there are many steps in ``antigen processing''~\cite{Cresswell_2005} (the process by which protein fragments find themselves loaded onto membrane-bound MHCs), processing''~\cite{Cresswell_2005}, it has become apparent that MHC binding is the most restrictive step. Peptide-MHC affinity prediction is the well-studied problem of predicting the binding strength of a given peptide and MHC pair\cite{Lundegaard_2007}. Early approaches focused on ``sequence motifs''\cite{Sette_1989}, followed by regularized linear models, linear models with interaction terms such as SMM~\cite{Peters_2003}, and more recently the NetMHC family of predictors, a collection of related models based on ensembles of neural networks. Two of these predictors, NetMHC~\cite{Lundegaard_2008} and NetMHCpan~\cite{Nielsen_2007}, have emerged as the method methods of choice across multiple fields of study within immunology, including virology~\cite{Lund_2011}, tumor immunology~\cite{Gubin_2015}, and autoimmunity~\cite{Abreu_2012}. % Initial approaches to predicting MHC ligands focused on ``sequence motifs''\cite{Sette_1989}, which were quickly replaced by a variety of regularized linear models, which themselves are consistently outperformed by regularized linear models with interaction terms such as SMM~\cite{Peters_2003}. The march toward black box non-linear models reached its local maximum with the NetMHC family of predictors, which are a collection of related models that utilize ensembles of neural networks. T

% (original): In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing both infected and cancerous cells. Each individual organism possesses a poly-clonal army of T-cells which collectively are able to distinguish rare unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs)~\cite{Blackman_1990}. Each distinct TCR is able to recognize a small number of similar peptides bound to an MHC molecule on the surface of a cell~\cite{Huseby_2005}. Though there are many steps in ``antigen processing''~\cite{Cresswell_2005} (the process by which protein fragments find themselves loaded onto membrane-bound MHCs), it has become apparent that MHC binding is the most restrictive step and consequently the most important sub-problem of predicting T-cell epitopes. NetMHC is an {\it allele-specific} method which trains a separate predictor for each allele's binding dataset, whereas NetMHCpan is a {\it pan-allele} method whose inputs are vector encodings of both the a peptide and a subset subsequence of a particular MHC molecule's primary sequence. molecule. The conventional wisdom is that NetMHC performs better on alleles with many assayed ligands whereas NetMHCpan is superior for less well-characterized alleles~\cite{Gfeller_2016}. In this paper we explore the space between {\it allele-specific} and {\it pan-allele} prediction by imputing the unobserved values of peptide-MHC affinities for which we have no measurements and using these imputed values for pre-training of allele-specific binding predictors.

% The figure doesn't show up in the preview but does show up if you export to pdf. \begin{figure}[hb] \includegraphics{figures/impute_comparison.pdf} \caption{\label{fig:imputecomparison} MHCflurry \caption{MHCflurry performance on down-sampled training data with for HLA-A*02:01 (with and without imputation} imputation)} \label{fig:imputecomparison} \end{figure} We compared the performance of two MHCFlurry-based models, ``mhcflurry ensemble'' and ``mhcflurry single,'' against netMHC, netMHCpan, and smmpmbec on the blind test data. The ``mhcflurry single'' model is one predictor with the architecture described previously. The ``MHCFlurry ensemble'' model is an MHCflurry ensemble of model contains 10 predictors, each identical to the single predictor predictors initialized with different randominitial weights. \begin{table}[hr] \centering

\toprule {} & AUC & $F_1$ score & Kendall's $\tau$ \\ \midrule MHCflurry ensemble (ensemble) & 0.93260 \textbf{0.93260} & 0.78459 & \textbf{0.58686} \\ MHCflurry single (single predictor) & 0.93225 & 0.78106 & 0.58572 \\ NetMHC & 0.93234 & \textbf{0.80722} & 0.58633 \textbf{0.58633} \\ NetMHCpan & \textbf{0.93264} & 0.79957 & 0.58138 \\ SMM-PMBEC & 0.92134 & 0.79026 & 0.56488 \\ \bottomrule \end{tabular}

\label{tab:measurementweighted} \end{table} The MHCFlurry ensemble predictor is competitive with NetMHC, but slightly worse than NetMHCpan. After running these experiments, we realized that the BD2009 / BLIND train and test datasets do not contain any alleles with fewer than 200 training observations. Since imputation only seems to help for alleles with fewer than 100 observations, this benchmark may not significantly benefit from the imputation approach. \section{Future Work} \section{Discussion} Imputing training data shows promise in cross-validation as a way to improve performance on alleles with few observations, but only seems to help for very small training sizes ($\leq 100$), not the alleles in the intermediate regime of $>200$ present 200$). Unfortunately, only one allele in the BLIND benchmark dataset. Further dataset had fewer than 200 samples. Thus, additional work is required to assess the accuracy of MHCFlurry MHCflurry and other predictors on alleles with very few training examples. Additionally, we need to further investigate the interaction between imputation parameters, the schedule according to which the weights of imputed samples are decayed, and stopping criteria for training individual allele-specific predictors. Nonetheless, even in its preliminary state, MHCflurry % These are generated in the 'paper plots' notebook; do not edit by hand.