Authorea

Timothy O'Donnell edited sectionContent_Text_.tex almost 8 years ago

Commit id: d68e5676a05e385838dd7d9b420ca83015ecbff4

deletions | additions

\section{Introduction} In organisms withan adaptive immune system, immunity, cellsof all tissues expose on their surfacessmall fragments of proteins synthesized by extracted from the cell. cytosol. These fragments peptide fragments, typically but not always 9 amino acids in length, are monitored by T cells, which recognize and kill infected or cancerous cells after detecting based on their viral, bacterial, mutant, or unusual protein fragments~\cite{Anderson_2004}. In order to peptides~\cite{Anderson_2004}. To be presented, presented the fragments, known as peptides, peptides must bind to a major histocompatability complex I (MHC I) protein, which acts as forms a platform for interaction with T cells. There are many thousands Peptide-MHC affinity prediction is the well-studied problem of MHC alleles in predicting thehuman population, each with a distinct binding preference for peptides. Though there are many steps in ``antigen processing''~\cite{Cresswell_2005}, the process by which protein fragments find themselves loaded onto membrane-bound MHC, it has become apparent that strength of a given peptide and MHC binding I pair\cite{Lundegaard_2007}. Initial approaches It is a well studied topic t is to predict the most restrictive step and consequently the most important sub-problem binding affinity of predicting T-cell epitopes. a peptide and an MHC I allele % (original): In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing both infected and cancerous cells. Each individual organism possesses a poly-clonal army of T-cells which collectively are able to distinguish rare unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs)~\cite{Blackman_1990}. Each distinct TCR is able to recognize a small number of similar peptides bound to an MHC molecule on the surface of a cell~\cite{Huseby_2005}. Though there are many steps in ``antigen processing''~\cite{Cresswell_2005} (the process by which protein fragments find themselves loaded onto membrane-bound MHCs), it has become apparent that MHC binding is the most restrictive step and consequently the most important sub-problem of predicting T-cell epitopes. Initial approaches to predicting MHC ligands focused on ``sequence motifs''\cite{Sette_1989}, which were quickly replaced by a variety of regularized linear models, which themselves are consistently outperformed by regularized linear models with interaction terms such as SMM~\cite{Peters_2003}. The march toward black box non-linear models reached its local maximum with the NetMHC family of predictors, which are a collection of related models that utilize ensembles of neural networks. Two of these predictors in particular, NetMHC~\cite{Lundegaard_2008} and NetMHCpan~\cite{Nielsen_2007}, have emerged as the preferred methods for computational prediction of MHC ligands across several areas of immunology, including virology~\cite{Lund_2011}, tumor immunology~\cite{Gubin_2015}, and autoimmunity~\cite{Abreu_2012}. % Though there are many steps in ``antigen processing''~\cite{Cresswell_2005}, it has become apparent that MHC binding is the most restrictive step and consequently the most important sub-problem of predicting T-cell epitopes. % (original): In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing both infected and cancerous cells. Each individual organism possesses a poly-clonal army of T-cells which collectively are able to distinguish rare unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs)~\cite{Blackman_1990}. Each distinct TCR is able to recognize a small number of similar peptides bound to an MHC molecule on the surface of a cell~\cite{Huseby_2005}. Though there are many steps in ``antigen processing''~\cite{Cresswell_2005} (the process by which protein fragments find themselves loaded onto membrane-bound MHCs), it has become apparent that MHC binding is the most restrictive step and consequently the most important sub-problem of predicting T-cell epitopes. The primary difference between NetMHC and NetMHCpan is that the former is an {\it allele-specific} method which trains a separate predictor for each allele's binding dataset, whereas the latter is a {\it pan-allele} method whose inputs are vector encodings of both the peptide and a subset of MHC molecule's primary sequence. The conventional wisdom is that NetMHC performs better on alleles with many assayed ligands whereas NetMHCpan is superior for less well-characterized alleles~\cite{Gfeller_2016}. In this paper we explore the space between {\it allele-specific} and {\it pan-allele} prediction by imputing the unobserved values of peptide-MHC affinities for which we have no measurements and using these imputed values for pre-training of allele-specific binding predictors.