deletions | additions
diff --git a/abstract.tex b/abstract.tex
index 362e031..fb06084 100644
--- a/abstract.tex
+++ b/abstract.tex
...
Predicting the binding affinity between MHC proteins their peptide ligands is a key problem in computational immunology. State of the art performance is currently achieved by the
single allele allele-specific predictor NetMHC and the pan-allele predictor
netMHCpan, NetMHCpan, both of which are ensembles of
shallow neural networks. We explore an intermediate between
single allele allele-specific and
pan-allelic pan-allele prediction:
an training allele-specific
predictor trained predictors with
data imputed from similar alleles. synthetic samples generated by imputation of peptide-MHC affinity matrix. We find that the imputation strategy is useful on alleles with very little training
data (under 100 examples). data. We
implement this idea in have implemented our predictor as an open-source software package called
MHCflurry, MHCflurry and show that MHCflurry achieves competitive
performance. performance to NetMHC and NetMHCpan.
diff --git a/sectionContent_Text_.tex b/sectionContent_Text_.tex
index f4e16e9..a81875e 100644
--- a/sectionContent_Text_.tex
+++ b/sectionContent_Text_.tex
...
\section{Introduction}
%In organisms with adaptive immunity, cells expose fragments of proteins extracted from the cytosol. These peptide fragments, typically but not always 9 amino acids in length, are monitored by cytotoxic T cells, which recognize and kill infected or cancerous cells based on their viral, bacterial, mutant, or unusual peptides~\cite{Anderson_2004}. Presented peptides must bind to a major histocompatability complex I (MHC I) protein, which forms a platform for interaction with T cells.
In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing both infected and cancerous cells. Each individual organism possesses a poly-clonal army of T-cells which collectively are able to distinguish unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs)~\cite{Blackman_1990}. Each distinct TCR recognizes a small number of similar peptides bound to an MHC molecule on the surface of a cell~\cite{Huseby_2005}. Though there are many steps in ``antigen
processing''~\cite{Cresswell_2005} (the process by which protein fragments find themselves loaded onto membrane-bound MHCs), processing''~\cite{Cresswell_2005}, it has become apparent that MHC binding is the most restrictive step. Peptide-MHC affinity prediction is the well-studied problem of predicting the binding strength of a given peptide and MHC pair\cite{Lundegaard_2007}. Early approaches focused on ``sequence motifs''\cite{Sette_1989}, followed by regularized linear models, linear models with interaction terms such as SMM~\cite{Peters_2003}, and more recently the NetMHC family of predictors, a collection of related models based on ensembles of neural networks. Two of these predictors, NetMHC~\cite{Lundegaard_2008} and NetMHCpan~\cite{Nielsen_2007}, have emerged as the
method methods of choice across
multiple fields of study within immunology, including virology~\cite{Lund_2011}, tumor immunology~\cite{Gubin_2015}, and autoimmunity~\cite{Abreu_2012}.
% Initial approaches to predicting MHC ligands focused on ``sequence motifs''\cite{Sette_1989}, which were quickly replaced by a variety of regularized linear models, which themselves are consistently outperformed by regularized linear models with interaction terms such as SMM~\cite{Peters_2003}. The march toward black box non-linear models reached its local maximum with the NetMHC family of predictors, which are a collection of related models that utilize ensembles of neural networks. T
...
% (original): In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing both infected and cancerous cells. Each individual organism possesses a poly-clonal army of T-cells which collectively are able to distinguish rare unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs)~\cite{Blackman_1990}. Each distinct TCR is able to recognize a small number of similar peptides bound to an MHC molecule on the surface of a cell~\cite{Huseby_2005}. Though there are many steps in ``antigen processing''~\cite{Cresswell_2005} (the process by which protein fragments find themselves loaded onto membrane-bound MHCs), it has become apparent that MHC binding is the most restrictive step and consequently the most important sub-problem of predicting T-cell epitopes.
NetMHC is an {\it allele-specific} method which trains a separate predictor for each allele's binding dataset, whereas NetMHCpan is a {\it pan-allele} method whose inputs are vector encodings of both
the a peptide and a
subset subsequence of
a particular MHC
molecule's primary sequence. molecule. The conventional wisdom is that NetMHC performs better on alleles with many assayed ligands whereas NetMHCpan is superior for less well-characterized alleles~\cite{Gfeller_2016}.
In this paper we explore the space between {\it allele-specific} and {\it pan-allele} prediction by imputing the unobserved values of peptide-MHC affinities for which we have no measurements and using these imputed values for pre-training of allele-specific binding predictors.
diff --git a/section_Results_We_evaluated_the__.tex b/section_Results_We_evaluated_the__.tex
index 9d04cc5..2976df6 100644
--- a/section_Results_We_evaluated_the__.tex
+++ b/section_Results_We_evaluated_the__.tex
...
% The figure doesn't show up in the preview but does show up if you export to pdf.
\begin{figure}[hb]
\includegraphics{figures/impute_comparison.pdf}
\caption{\label{fig:imputecomparison} MHCflurry \caption{MHCflurry performance on down-sampled training data
with for HLA-A*02:01 (with and without
imputation} imputation)}
\label{fig:imputecomparison}
\end{figure}
We compared the performance of two MHCFlurry-based models, ``mhcflurry ensemble'' and ``mhcflurry single,'' against netMHC, netMHCpan, and smmpmbec on the blind test data. The ``mhcflurry single'' model is one predictor with the architecture described previously. The
``MHCFlurry ensemble'' model is an MHCflurry ensemble
of model contains 10
predictors, each identical to the single predictor predictors initialized with different random
initial weights.
\begin{table}[hr]
\centering
...
\toprule
{} & AUC & $F_1$ score & Kendall's $\tau$ \\
\midrule
MHCflurry
ensemble (ensemble) &
0.93260 \textbf{0.93260} & 0.78459 & \textbf{0.58686} \\
MHCflurry
single (single predictor) & 0.93225 & 0.78106 & 0.58572 \\
NetMHC & 0.93234 & \textbf{0.80722} &
0.58633 \textbf{0.58633} \\
NetMHCpan & \textbf{0.93264} & 0.79957 & 0.58138 \\
SMM-PMBEC & 0.92134 & 0.79026 & 0.56488 \\
\bottomrule
\end{tabular}
...
\label{tab:measurementweighted}
\end{table}
The MHCFlurry ensemble predictor is competitive with NetMHC, but slightly worse than NetMHCpan. After running these experiments, we realized that the BD2009 / BLIND train and test datasets do not contain any alleles with fewer than 200 training observations. Since imputation only seems to help for alleles with fewer than 100 observations, this benchmark may not significantly benefit from the imputation approach.
\section{Future Work} \section{Discussion}
Imputing training data shows promise in cross-validation as a way to improve performance on alleles with few observations, but only seems to help for very small training sizes ($\leq
100$), not the alleles in the intermediate regime of $>200$ present 200$). Unfortunately, only one allele in the BLIND
benchmark dataset. Further dataset had fewer than 200 samples. Thus, additional work is required to assess the accuracy of
MHCFlurry MHCflurry and other predictors on alleles with very few training examples.
Additionally, we need to further investigate the interaction between imputation parameters, the schedule according to which the weights of imputed samples are decayed, and stopping criteria for training individual allele-specific predictors. Nonetheless, even in its preliminary state, MHCflurry
% These are generated in the 'paper plots' notebook; do not edit by hand.