\documentclass[10pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\usepackage{natbib}
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[greek,english]{babel}
% Edit this header.tex file to include frontmatter definitions and global macros
% Add here any LaTeX packages you would like to load in all document blocks
% \usepackage{xspace}
\usepackage{amsmath}
\usepackage{textgreek}
% Add here any LaTeX macros you would like to load in all document blocks
% \def\example{This is an example macro.}
% -----
\iflatexml
% Add here any LaTeXML-specific commands
\newcommand{\beginsupplement}{%
\setcounter{table}{0}
\renewcommand{\thetable}{S\arabic{table}}%
\setcounter{figure}{0}
\renewcommand{\thefigure}{S\arabic{figure}}%
}
% -----
\else
% Add here any export style-specific LaTeX commands. These will only be loaded upon document export.
% \paperfield{Subject domain of my document}
% \keywords{keyword1, keyword2}
% \corraddress{Author One PhD, Department, Institution, City, State or Province, Postal Code, Country}
% \fundinginfo{Funder One, Funder One Department, Grant/Award Number: 123456.}
\fi
\begin{document}
\title{Evaluating \emph{eUniRep} and other protein feature representations
for~\emph{in silico} directed evolution}
\author{Andrew Favor}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\selectlanguage{english}
\begin{abstract}
This study analyses and adds to the Low-N protein engineering with
data-efficient deep learning work done by Biswas et
al~\textsuperscript{\hyperref[csl:1]{1}}. We provide a complete,~open-source, end-to-end
re-implementation of the~\emph{in silico}~protein engineering pipeline
with improved computational efficiency,~ more detailed documentation,
cleaner API and additional features to lower the barrier to entry for
use of this pipeline as an engineering tool. We additionally perform a
more thorough evaluation of the success and necessity of each step in
the pipeline for~\emph{in silico} directed evolution, by re-implementing
select portions of the study of TEM-1 \selectlanguage{greek}β-\selectlanguage{english}lactamase, as well as applying
the full~\emph{in silico} pipeline to 2 novel protein engineering tasks
- increasing the melting temperature of plastic degrading enzyme
IsPETase and improving the thermostability of viral capsid bacteriophage
coat protein MS2. Our findings corroborate some of the main benefits of
the eUniRep protein representation highlighted in Biswas et al, but we
also highlight some key limitations not previously discussed. Finally we
provide a simple mathematical and case study proof that linear kernels
are equivalent to additive fitness landscapes and outperform more
complex models on small or single mutation prediction tasks. This is
assumed in many previous works but never explicitly shown. We believe it
helps to further elucidate the main strength of the eUniRep
representation in its ability to overcome epistatic effects in proposing
extensively mutated candidate sequences with optimized functionality.%
\end{abstract}%
\sloppy
\par\null
\section*{Introduction}
{\label{742736}}\par\null
The foundation of protein engineering lies in introducing new mutations
for the purpose of modifying protein function. The first attempts to
rationally design new mutants with improved function relied on
leveraging chemical and biological domain knowledge to correlate human
interpretable patterns and motifs in secondary and tertiary folded
protein structures with possible functionality improvements
~\textsuperscript{\hyperref[csl:2]{2}}. These traditional predictive pipelines require
accurate characterization or prediction of protein folded 3D structures,
which is a difficult task~\textsuperscript{\hyperref[csl:3]{3}}. Thus one of the grand
challenges of modern computational protein engineering is in skipping
structure entirely, and instead developing accurate models to predict
direct amino acid sequence-to-function relationships. Directed evolution
is a hypothesis-free, non-structure dependent approach that has been
efficiently utilized in order to optimize the functionality of
bio-molecular systems through the improvement of selected
traits~\textsuperscript{\hyperref[csl:4]{4},\hyperref[csl:5]{5},\hyperref[csl:6]{6}}. However, the original directed evolution
pipeline is both slow and resource intensive - requiring multiple high
throughput iterations of sequencing and functional characterization.
Moreover, desired protein function is often difficult to test
experimentally in high throughput, thus a correlated proxy fitness
function is instead used as the predictive label. The search space
(fitness landscape) of possible amino acid residue mutations is both
immense in size and sparse in protein structures that actually
fold~\textsuperscript{\hyperref[csl:7]{7},\hyperref[csl:8]{8}}, let alone that yield improved fitness. Even
when restricted to~considering the number of variant primary sequences
that can result from just two co-expressed mutations, it is virtually
impossible to test the fitness of all possible combinations through
physical synthesis or even computational modeling~\textsuperscript{\hyperref[csl:4]{4},\hyperref[csl:9]{9},\hyperref[csl:10]{10}}.
This necessitates the development of more sophisticated predictive
models. Machine learning techniques can be used to computationally guide
the directed evolution process, reducing the number of experimental
cycles required~\textsuperscript{\hyperref[csl:11]{11}}.
A remaining key bottleneck for even further improvements in speed and
experimental cost reduction, is in the high number (N) of initial wild
type mutants (training set) needed to develop any predictive model. In a
recent manuscript, the Church Lab have solved this issue using
data-efficient deep learning to achieve low-N protein
engineering~\textsuperscript{\hyperref[csl:1]{1}}. The approach described by Biswas et al
holds promise as a powerful tool in protein engineering: given only a
small library of synthesized and characterized protein variants, and a
single directed evolution cycle, one could engineer extensively modified
proteins with optimized functionality. Several existing protein
engineering procedures and softwares utilize a common Markov-chain Monte
Carlo (MCMC) simulation regime with a Metropolis-Hastings acceptance
criteria defined by a Boltzmann probability distribution, with a
temperature parameter governing the rate at which new mutations are
adopted over current states throughout the mutagenesis trajectory. The
relative favorability of proposed mutations is approximated by
statistical mechanics based energy calculations~\textsuperscript{\hyperref[csl:12]{12},\hyperref[csl:13]{13}}.
However, these approaches fall short in their ability to accurately
predict the fitness of a given mutation, as their applied calculations
are largely restricted to a protein's native folded state, and cannot
predict whether a mutation allows the protein to assemble through a
folding pathway to reach that state in the first place. The~\emph{in
silico} directed evolution protocol in Biswas et al makes 2 major
improvements to existing schemes. The first, is in exploiting the often
ignored dark proteome by doing unsupervised pre-training on millions of
uncharacterized protein sequences to learn a semantically rich,
universal representation (UniRep) of protein
sequences~\textsuperscript{\hyperref[csl:14]{14}}. This represention is further fine-tuned on
a smaller, curated subset of evolutionarily relevant sequences to get a
final ``evotuned'' representation (eUniRep). The second key insight is
in using a non-physical Boltzmann distribution that is a function of
eUniRep-dependent fitness predictions in their MCMC~\emph{in silico}
directed evolution. In this non-physical Boltzmann distribution, the
exponent numerator argument is the difference between a proposed
mutant's fitness and the current sequence's fitness value. A significant
benefit is expected to be found through the use of this scheme, as the
eUniRep representations are trained to encode features containing
information on the~natural laws that govern favorable folding and
function capabilities for proteins. Optimization of this pipeline will
be critical in the future of protein engineering for the development of
various technologies ranging from nanomaterials and drug
delivery~\textsuperscript{\hyperref[csl:15]{15}} to enzymatic plastic
degradation~\textsuperscript{\hyperref[csl:16]{16},\hyperref[csl:17]{17}}. In this work we provide a
complete,~open-source, end-to-end re-implementation of the~\emph{in
silico} UniRep protein engineering pipeline with improved computational
efficiency,~ more detailed documentation, cleaner API and additional
features to lower the barrier to entry for use of this pipeline as an
engineering tool.
The cornerstone to the success of the low-N pipeline is the eUniRep
representation. Better features leading to improved model peformance is
a well established fact in machine learning, and it is becoming a more
ubiquitously exploited way to improve model performance in protein
bioinformatics as well~\textsuperscript{\hyperref[csl:18]{18},\hyperref[csl:19]{19},\hyperref[csl:20]{20},\hyperref[csl:21]{21}}. To further explore the
success and necessity of the eUniRep feature space, we use our modified
pipeline to compare the performance of various protein feature
representations, with various supervised models, on 3 vastly different
proteins, optimizing for 3 different functions: (1) Increasing catalytic
rate of TEM-1 \selectlanguage{greek}β-\selectlanguage{english}lactamase (PDB: 1ZG4), using wild type and mutant
fitnesses from Firnberg et al~\textsuperscript{\hyperref[csl:22]{22}}, as~a reproducibility
study to try replicate and add to the results and analysis presented in
Biswas et al. (2) Increasing the melting temperature of~\emph{Ideonella
sakaiensis} PETase (IsPETase, PDB: 6QGC), a recently discovered enzyme
that can degrade polyethylene terephthalate (PET) and other aromatic
polyesters. We chose PETase as plastic waste is an ever growing problem
demands an immediate solution. PETase is revolutionary in its ability to
degrade the ubiquitously used PET under mild conditions, however, its
low melting temperature and inherent instability greatly limit its
practical use on an industrial scale - thus the need to engineer more
thermally stable mutants. Moreover, as there are a multitude of research
groups working on PETase engineering, it allows us to directly compare
the eUniRep~\emph{in silico} directed evolution protein engineering
approach to other recently published computational protein engineering
pipelines, such as the GRAPE method~\textsuperscript{\hyperref[csl:23]{23}}. Cui et al have
used greedy algorithms and k-means clustering to design a IsPETase
mutant with a melting temperature elevated by 31 degrees from wild type,
using a similarly low-N number of starting mutants. (3) Finally, we
chose the icosahedral coat-protein of the MS2 bacteriophage (PDB:2MS2),
due to the availability of extensive fitness data for mutated variants
of this protein allowing for in depth epistasis
analysis~\textsuperscript{\hyperref[csl:15]{15}}, as well as the ability of virus-like
nanoparticles to provide a stable scaffolding for the development of
nanomaterials, with a versatile range of new functionality and a
remarkable tolerance to extensive synthetic
modification~\textsuperscript{\hyperref[csl:24]{24},\hyperref[csl:25]{25},\hyperref[csl:26]{26}}.~
\par\null
\section*{Method}
{\label{874460}}
\subsubsection*{}
{\label{145044}}
\subsubsection*{Existing Pipeline}
{\label{323002}}
We initially re-implemented the 5 \emph{in-silico} steps of the low-N
protein engineering pipeline as outlined in Biswas et
al~\textsuperscript{\hyperref[csl:1]{1}} for a target protein. This~\emph{in-silico}
method assumes prior wet-lab functional characterization of a small
number (low-N) of random mutants of the wild-type target protein:
\begin{enumerate}
\tightlist
\item
Learn a global, 1900-dimensional protein feature representation by
training a mLSTM to do next character prediction on \textgreater{} 20
million raw (unlabeled) amino acid sequences~\textsuperscript{\hyperref[csl:14]{14}}. This
is the ``UniRep'' representation.
\item
Curate a smaller (order of 10,000) dataset of characterized protein
sequences that are known to be evolutionarily related to the target
protein.
\item
Learn features local to the target protein by fine-tuning the weights
from step 1 on the dataset from step 2. This is the ``evotuned
UniRep'' or ``eUniRep'' representation.
\item
Train a ``simple'' supervised top model (enseble ridge regression with
sparse refit) on the 1900-dimensional feature space representation of
a small number (\textless{}100) of characterized (labeled) wild-type
mutants of the target protein (obtained by passing the protein
sequences through the mLSTM trained in step 3). This top model
provides an end-to-end sequence-to-function model for proteins local
to the target protein.
\item
Markov Chain Monte Carlo (MCMC) based~\emph{in silico} directed
evolution on the target protein, outputting mutated sequences that are
predicted by the top model in step 4 to have the most improved
function relative to wild-type (\textgreater{}WT). Although the
in-silico pipeline is complete, the top mutant sequences must then be
functionally characterized to complete the full directed evolution
cycle and confirm functional improvement in the engineered protein
sequences.
\end{enumerate}
As the training in step 1 was reported to take 3.5 weeks of wall-clock
time on an AWS instance, we opted to use the weights provided by Alley
et al~\textsuperscript{\hyperref[csl:14]{14}} as our starting point. Due to the minimal
open-source code available, we largely re-implemented the remainder of
the pipeline from scratch, following provided method descriptions in
Biswas et al. We identified 5 key areas for improvement with our
re-implementation:
\subsubsection*{1. Curation of evolutionarily similar sequences for
evotuning}
{\label{143907}}
The EBI JackHMMer web application~\textsuperscript{\hyperref[csl:27]{27}} used to curate the
training set of sequences for evotuning is unable to reliably handle a
high volume of output hits for similar protein sequences. The
alternative to install and run HMMer locally requires downloading the
entire ReferenceProteomes database, which was unrealistic given our
local storage constraints. A potential workaround could be to script
around a local install of HMMer to search the ReferenceProteomes
database from an online source. In our modified pipeline, we draft a
formalized procedure for the curation of the local dataset using the
search using the Pfam and InterPro online databases and APIs
(Supplementary 1).
\subsubsection*{2. mLSTM training and representation fetching speed and
memory
improvements}
{\label{279270}}
The mLSTM model implemented on TensorFlow used was unable to process
sequences \textgreater{}275 amino acids in length due to memory
constraints, even with up to 16GB of RAM. This poses a major issue for
many proteins with sequence lengths greater than this, including PETase,
with sequence length of approximately 300. We collaborated to complete
an open-source JAX re-implementation of the mLSTM to get UniRep
representations of sequences, as well as for evotuning to get eUniRep
sequence representations. JAX was chosen to achieve 100x speedup in
passing new protein sequences through the trained mLSTM to get the
1900-dimensional UniRep / eUniRep representations. Technical
implementation details and performance comparisons can be read in Ma et
al~\textsuperscript{\hyperref[csl:28]{28}}. The JAX implementation crucially had the added
benefit of requiring less memory to run, affording the ability to
evotune on longer protein sequences.
Evolutionary fine-tuning of the UniRep representations for PETase and MS2 was performed on an Amazon Web Services EC2 instance (g4dn.2xlarge:8 vCPUs, 32 GiB RAM, and one NVIDIA T4 Tensor Core GPU). Each evotuning job was run for 25 epochs. Evotuning was performed with learning rates of 10\textsuperscript{-5} and 10\textsuperscript{-6}. Additionally, a local-evotuning procedure was performed for MS2, wherein training weights were initialized with random values, rather than the pre-trained Global UniRep weights, using a learning rate of 10\textsuperscript{-5} (Fig 1a). eUniRep weights for TEM-1 were provided by Biswas et al.
\subsubsection*{3. Rigorous top model selection and fine
tuning}
{\label{543918}}
Biswas et al only report performance comparisons for 4 top model types:
Lasso-Lars, Ridge, Ridge with sparse refit, and ensemble ridge with
sparse refit - all with minimal information presented on what
hyperparameters were used and how the well the top models fit. Moreover,
top model performance comparisons are only given in the form of
retrospective experiments for avGFP (Green Fluorescent protein Aequorea
victoria), based off which it is both unconvincing that (A) pre-training
on 20 million sequences is required, as Local eUniRep appears to show
comparable performance, and (B) that ensembled ridge regression with
sparse refit is required, as it only demonstrates minor improvement over
basic ridge regression in fold improvement over random sampling, while
requiring a greater training time and a larger number of hyperparameters
to tune. Motivated by the lack of reported results comparing and
evaluating various top model performances, we implemented a thorough,
general top model evaluation script that can be run on each new target
protein to determine the best top model and hyperparameters for a given
dataset.
\subsubsection*{4. New scoring function to evaluate top model
performance}
To quantify the performance of a predictive model within this specific
context of application, it is important to acknowledge the important
factors in its ultimate use case: the Metropolis-Hastings acceptance
criteria. In the MCMC simulations that emulate directed evolution, the
probability a proposed mutant sequence is accepted as the new sequence
at each iteration is dependent on if the proposed sequence has a greater
fitness score than the current sequence. Thus, an appropriate top model
would not necessarily prioritize predicting fitness scores that optimize
the average closeness to the experimental fitness scores, but rather
prioritize accurately predicting the relative magnitude of each fitness
score with respect to the values of other mutant sequences being
considered. In Biswas et al, the only purely \emph{in silico} top model
characterization is done with the following scoring metric: use a given
top model to do a ranked sorting of a holdout set of mutant sequences.
The score given is the counts of mutants in the sorted top 10\% that
have fitness greater than wild type, as a ratio over averaged, random
10\% samplings. The shortcoming of this scoring scheme is its lack of
generalizability. The choice to look at the top 10\% is an arbitrary
one, and it requires as input the wild type fitness value, specific to
each protein and function. Furthermore, scores will not be comparable
across different datasets, even within the same protein and function, as
the scores are highly dataset distribution dependent. To solve these
issues and optimize parameters for the task of fitness score ranking, we
developed a new scoring function (Supplementary 2).
\subsubsection*{5. Quality of life
improvements}
{\label{887484}}
Finally and perhaps most importantly, to improve end-user usability and
lower the barrier of entry for non-machine learning experts to utilize
this pipeline in their research, we have made our entire pipeline well
documented, open source and easily customizable.
\subsubsection*{Feature Analysis}
{\label{732257}}
With the re-implemented pipeline, we replicate the analysis done in
Biswas et al, comparing various protein feature representations with
some additional comparisons: UniRep, various different evotuned weights
(eUniReps), local evotuned, one-hot encodings of the full amino acid
sequence, and reduced dimension representations by PCA of all of the
above. This served to evaluate if eUniRep is necessary and functional
across different new proteins and functions we seek to optimize.
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/big-pred-performance-fig-with-epistasis/big-pred-performance-fig-with-epistasis-NEW-ERROR-FUNC}
\caption{{\textbf{(a).}~ Loss-trajectories over the course of evotuning for MS2
and PETase.~~\textbf{(b).}~ Improvement over random sampling for ability
to predict the relative fitness rankings of test-set mutants; datasets
include MS2 single-mutants, MS2 double-mutants, TEM-1~\emph{\selectlanguage{greek}β}-\selectlanguage{english}lactamase
single-mutants, and PETase single-mutants.~~\textbf{(c).}~ Comparison of
top-model predictions of MS2 double-mutant fitness when trained on MS2
single-mutant dataset.~ Top models are trained using global UniRep,
eUniRep, and one-hot sequence representations.~ Predicted fitness values
are compared to experimentally measured fitness, and additive
fitness-change predictions based on single-mutant differences from
wild-type fitness.
{\label{809701}}%
}}
\end{center}
\end{figure}
\par\null
\section*{Results \& Discussion}
{\label{846141}}
\subsection*{}
{\label{667119}}
\subsubsection*{The success and necessity of feature representations
varies across 3 different
proteins}
{\label{268707}}
\textbf{{[}ANDREW: update this entire section needs to be updated with
new scoring function results - but the general structure is in place, i
also think we should just group PCA into this section as its in the
figures anyway. Just write about the results and i'll add some words
about the significance of it.{]}} Our first objective was to confirm our
pipeline is able to reproduce the results presented for TEM-1
\selectlanguage{greek}β-\selectlanguage{english}lactamase in Biswas et al.~ To using our fitness-ranking error
function, we calculated the percent reduction in ranking-error for four
different mutant datasets, using several different embedded sequence
representations, with performance compared across five different
training-batch sizes~\textbf{(Fig 1b)}.~ For the TEM-1 \selectlanguage{greek}β-\selectlanguage{english}lactamase
dataset, percent error reduction was calculated for training-batch sizes
of N = 24, 48, 72, 96, and 120; the best reduction in error, averaged
across all batch sizes, was achieved using an evotuned representation~
(average error reduction =~ 26.05\%), followed closely by the Global
UniRep representation (20.51\%).~ These results were compared against a
one-hot encoding, which performed worse than the UniRep inputs, with a
6.50\% average reduction in ranking-error.
In evaluating performance using our single-mutant MS2 dataset, we see
that the evotuned representation and Global UniRep only produce minor
improvements in ranking error (6.62\% and 9.19\% error reduction for
Global UniRep and evotuned UniRep, respectively), while the one-hot
encodings performed only slightly worse than Global UniRep, with a
6.35\% error reduction over random sampling.~ Additionally, we tested
performance of a locally-evotuned representation, which was trained for
25 epochs after being initialized with random parameters, rather than
the Global UniRep weights.~ The locally-evotuned representation gave the
worst performance of all representations tested, with a
3.48\%~\emph{increase} in ranking-error over random sampling.~ ~We
repeated this analysis using a second data set of MS2 mutants,
containing fitness scores for all possible combinations of double amino
acid substitutions within the regions of residues 71 to 76. When using
these data, we found a significant improvement in our top-model's
ability to predict the relative fitness rankings of test-batch mutants,
with \% error reduction of 33.09, 39.80, 4.39, and 37.65\% for Global
UniRep, evotuned-UniRep, locally-UniRep, and one-hot encodings,
respectively, averaged over training batch sizes of N = 24, 48, 72, 96,
and 120.~ For both single and and double mutant data sets, the worst
predictive performance was seen when using local-UniRep; we suspect that
this is due to the low number of evotuning epochs used to produce this
representation's parameters, and that this short training period is not
sufficient for getting from a random initial state to the deep-learned
sequence embeddings possessed by the globally trained UniRep.
The same analysis described above was performed on our 85-sample PETase
data set, but with smaller training and test batch sizes, due to having
less data available. In addition to comparing performance using UniRep,
PETase-eUniRep, and one-hot encodings, we tested predictive performance
using weights that had been evotuned for TEM-1 Beta-lactamase, which
were provided by Biswas et al.~ Averaged over training-batch sizes of N
= 24, 36, 48, 60, and 72, our predictive models produced ranking-error
reductions of 2.68, 1.46, 6.99, and 1.20\% for the Global UniRep,
PETase-eUniRep, TEM1-eUniRep, and one-hot sequence representations,
respectively. We chose to compare the performance of representations
evotuned for PETase and TEM-1 in order to provide a way to evaluate the
efficacy of our own evotuning procedure relative to that performed by
Biswas et al.~ While Beta-lactamases (TEM-1) and Lipases (PETase) come
from fairly different families of enzymes, we expected that their
evotuned representations would possess some similarities.~ Our
experiments show the TEM-1 weights yielding better results than those
specifically evotuned for PETase, suggesting that modifications to our
own evotuning procedure may be necessary in order to access the full
potential of evotuned representations.
The full set of error-reduction results for various combinations of
batch sizes and sequence representations for each protein dataset can be
seen in supplementary figure (XX).
\textbf{ADD COMMENTS ON PCA. (probs need a supplementary section with
PCA figs / discussion)}
\textbf{AT THE END THIS SECTION SHOULD ANSWER THE QUESTION - SO DO WE
ACTUALLY NEED FULL 1900-DIM EUNIREP FOR SINGLE MUTANT PREDICTION?}
However, it is important to remember that these results are not fully
accurate, as single mutants do not properly represent the evolutionary
fitness landscape. Multiple consecutive mutations need to be considered
to represent epistatic effects, that may not be captured by more simple
top models.
\subsubsection*{}
{\label{892496}}
\subsubsection*{Epistasis detection: prediction of consecutive mutations
fitness from single mutant
data}
{\label{892496}}
An effective predictive model model for protein engineering purposes,
needs to be able to predict the epistatic effects of multiple
co-expressed mutations with better accuracy than the predictions
provided by the additive changes from multiple single-mutant fitness
scores. We define epistasis herein as the difference in the effect that
multiple mutations have when expressed together, from the additive sum
of the individual mutations~\textsuperscript{\hyperref[csl:29]{29},\hyperref[csl:30]{30}}. Due to the cooperative
nature of interactions between neighboring amino acids in a protein, it
is challenging to predict the effect that a single amino acid mutation
has on the protein's overall ability to fold and retain a stable and
functional state. The introduction of multiple mutations increases the
complexity of this problem exponentially, and can effectively result in
novel interactions between amino acids, not observed in the wild type.
To characterize the ability of the different embedded sequence
representations to describe the effects of consecutive mutations, we
trained our top model on input data from the MS2 single-mutant fitness
landscape, and compared its predictions of double mutant fitness to
experimentally determined values for the fitness of double-mutants
(\textbf{Fig 1c}).~When trained on the full single-mutant fitness
landscape data, and tested on the double mutant dataset, the best
predictive accuracy was achieved by use of the one-hot encoding (MSE =
0.743), followed closely by eUniRep and Global UniRep (MSE = 0.987 and
1.07, respectively). As an additional comparative metric to the one-hot
representations, we calculated the additive predictions of fitness for
double mutants, using the single-mutant fitness data and Equation 2,
which had a mean square error of 0.841, performing just slightly worse
than the one-hot representations.\emph{~ ~\textless{}--~}\textbf{ANDREW:
i'm confused about this entire italicized section, what is it actually
talking about ? what are the MSE values from? trained on single tested
on double or tested on single? why do the one hots perform the best if
it is on double? i thought they were supposed to suck?}
In the case of double-mutants, our analysis employs the following mathematical definition of the epistasis associated with a double-mutation, imposed on residues $i$ and $j$:
\begin{equation}
\begin{align}
\epsilon =\Delta f_{ij} -( \Delta f_{i} + \Delta f_{j} )
\end{align}
\end{equation}
where $\Delta f_{x} = f_{x} -f_{WT}$. In fitness-landscape regions where epistatic effects are minimal, the compounding effects of two mutations are expected to be "additive", such that the double-mutant fitness of mutations $i$ and $j$ can be predicted by:
\begin{equation}
\begin{align}
f_{ij} = f_{wt} + \Delta f_{i} + \Delta f_{j}
\end{align}
\end{equation}
In our characterization of how UniRep contributes to the prediction of
double-mutations when trained solely on single mutations, we find that
they are often out-performed by one-hot representations. ~At first
glance, this would suggest that learned sequence representations, like
UniRep, offer negligible benefits over simple sequence representations
like one-hot encodings; however, when applying predictive models to
contexts where a large number of consecutive mutants are introduced, the
benefit of deep-learned representations is likely to become more
pronounced.~ As has been described in comparative analyses of the role
that sequence representations play in protein fitness prediction models,
linear models offer adequate performance on fitness landscapes where
epistatic effects are limited; the effects of consecutive mutations
trend with the additive changes in fitness predicted by the summed
effects of their constituent single amino acid
mutations~\textsuperscript{\hyperref[csl:11]{11}}.~ This trend is clearly illustrated in
{[}supplementary figure X{]}, where the one-hot trained model
predictions are nearly identical to predicted additive changes in
fitness (calculated using~\textbf{Equation 2}). ~Furthermore, the large
degree of similarity between one-hot predictions and additive
predictions suggests that the double-mutant data are sampled from a
region of considerably low epistatic effects, causing the double-mutant
fitness values to trend with the sum of their single-mutant
constituents.~ Following this observation, we decided to derive the
conditions under which one-hot representations of protein sequences
would lead to predictions that resemble additive fitness changes for
consecutive mutations, when training a linear model on single-mutant
data (\textbf{Supplementary Discussion 2}).~ It is expected that one-hot
encodings offer adequate performance in regions of minimal epistatic
effects, and when a predictive model is only considering simple mutation
cases, with low diversity of interaction types, such as the combined
effect of two co-expressed mutations.~ However, the margin of error for
any model grows exponentially with each consecutive round of
mutagenesis, as each additional mutation brings with it a range of new
potential interactions between neighboring residues; this is likely
where deep-learned representations such as UniRep show their true value,
having been trained to identify patterns of favorable or unfavorable
local amino acid arrangements.~ On top of UniRep's ability to embed
general information regarding amino acid arrangement patterns, it is
likely that an evotuned representation would provide further value by
embedding taxonomically-specific information, such as where specific
residue arrangements should be located within a full amino acid
sequence, for a given family of protein, such that the desired tertiary
structures will assemble.
\par\null
\emph{at the end of this section we should have shown that one-hot (i.e.
linear kernel) == linear addivitity and you can see a mathematical proof
confirming this. Here we talk about all of our conclusions on one-hot vs
unirep/eunirep - we should conclude that EUNIREP IS INDEED NECESSARY,
AND ONE-HOT IS ACTUALLY GARBAGE.}
\subsubsection*{}
{\label{197915}}
\subsubsection*{Top model selection and
tuning}
{\label{197915}}
In our analysis so far we have assumed the use of an optimized ridge
regression top model. For the purpose of finding a model that could
accurately predict the effects of a new mutation for a given sequence ,
we tested the predictive accuracy of several linear regressions, with a
range of hyperparameters. While various regression models were tested
inclusive of Lasso, Huber, RANSAC, and ensemble methods, we found
minimal performance improvement with increased model complexity over
simple linear ridge regression. Ridge regression has a single
regularization hyperparameter that can be optimized by cross-validation.
In Biswas et al, a sparse refit is used with the effect of increasing
regularization strength, to prevent overfitting to the starting mutant
dataset which is likely not representative of the actual fitness
landscape. In our pipeline the scoring function for cross-validation is
set as our custom scoring function (Supplementary 2). For each protein
evaluated performance at various training set sizes ranging from N = 8
to 1600. \textbf{Add something about how we see performance differing as
we vary N (although i now see that stuff is included further up).}
\subsubsection*{}
{\label{708604}}
\subsubsection*{\texorpdfstring{MCMC \emph{in silico} directed
evolution}{MCMC in silico directed evolution}}
{\label{708604}}
Top-model training batch sizes and hyperparameters were selected before
simulated directed evolution runs for each protein separately.
Hyperparameters and batch sizes were chosen based on validation-set
fitness predictions (\textbf{Fig 2a}), using a ranking-error function
described in (Supplemental discussion); selected alpha values were 1e-2
for MS2 and PETase, and 1e-3 for Beta-lactamase; batch sizes were 240,
280, and 71, for Beta-lactamase, MS2, and PETase, respectively. Our
evolution trajectories were ran for 25 iterations of random mutagenesis
coupled with selection steps based on the predicted fitness of proposed
mutations (\textbf{Fig 2b}). While we were able to effectively
incorporate this component into our pipeline, and are therefore able to
produce a large number of candidate mutants for the evaluated proteins,
experimental validation is still required in order to determine whether
the proposed variants do in fact possess the desired trait enhancements.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.98\columnwidth]{figures/DE-results-fig-combined/DE-results-fig-combined}
\caption{{\textbf{(a).} Predictive performance of linear top-models used in
directed evolution simulations.~ Ridge-regressions were trained and
tested on single-mutant datasets of~\selectlanguage{greek}β-\selectlanguage{english}lactamase, MS2, and PETase.~
\textbf{(b).}~ Five replicate trajectories of simulated~\emph{in silico}
directed evolution for variants of~\selectlanguage{greek}β-\selectlanguage{english}lactamase, MS2, and PETase, each
with 25 steps of potential mutation and selection.
{\label{229469}}%
}}
\end{center}
\end{figure}
\par\null
\subsection*{POINT 6}\label{point-6}
Compare the 3 proteins and how input span range is really importance and
how if you don't have a wide enough range of input data you get shit
performance (i.e. in TEM-1 fitting was ass until we pulled randomly from
the entire dataset instead of just part of it).
Also talk about length of 2MS2 sequences vs PETase and TEM1, as well as
the actual inherent function we are trying to predict. catalytic
activity is unique to TEM1 but stability is not unique to 2MS2.
While one of our main goals in this project was to use the mutant data
from Cui et al to PETase variants that exhibit improved
T\textsubscript{m}, our all predictive models trained with this data set
showed poor performance. We suspect that this poor performance can be
attributed to two primary factors: first, the data set was considerably
small, with only 85 mutant samples; second, many of the data represent
mutations occurring towards the middle of the amino acid sequence, and
likely do not adequately span the configuration space of possible
mutations that we would be predicting for.
\textbf{Could all of our evotuning be a little off? eUniRep performs
worse than unirep in PETase and 2MS2. we see that the evotuned weights
for TEM-1 actually perform better than the ones evotuned for
PETase\ldots{} hard to say what that tells us.}
\par\null
\section*{Future Work}
{\label{155355}}
There are several avenues for further exploration and further pipeline
improvement. In this report we primarily explore top model and directed
evolution, due to the computational and time constraings involved in
training the mLSTM for generating UniRep and evotuned weights. As a
there is a lot of further work that can be done with the pre-training
model. One curious finding from our few evotuning runs (\textbf{Fig 1a})
is that evotuning loss does not correlate with top model performance,
per our custom scoring function. In order to further probe this
relationshp and determine the optimal stopping point for mLSTM training,
directed evolution performance should be validated for various
combinations of different learning rates and number of epochs run.
Moreover, given the uncertainty in the source of the superior
performance of the TEM-1 evotuned weights over the IsPETase evotuned
weights on the IsPETase dataset, it could be helpful to retrain the
UniRep weights from scratch and compare to the performance of the Church
Lab provided UniRep weights to eliminate a faulty JAX re-implementation
as a potential error source. One of the key differences between the JAX
and TensorFlow (TF) implementations of the mLSTM are that the TF model
pads sequences of different lengths and trains in fixed batch sizes,
whereas the JAX implementation does not pad sequences, training in
variable length batches of fixed sequence lengths (to achieve
computational speedup). In practice, it is often seen that fixed
batch-size training will yield better convergence, perhaps by modifying
our implementation to train on fixed batch sizes (while still
maintaining computational speedup) top model performance can be further
improved. Finally, as is happening in various fields in bioinformatics
and natural language processing, mLSTMs are being replaced by
transformers \textsuperscript{\hyperref[csl:31]{31}}, this is something that could be tried
here.
Ultimately however, the most important future work that needs to be done
is experimental characterization of our directed evolution outputs. We
believe the biggest contribiution of this work to be
our~complete,~open-source, generalized, end-to-end re-implemented
pipeline, and we hope researchers and engineers will find it helpful for
use in their work. Perhaps someone reading this will have the lab setup
to help do the experimental characterization, either on our 3 protein
case studies, or a novel one of their own!
\par\null
\section*{Software Repository}
{\label{918601}}
Full pipeline
re-implementation:~\url{https://github.com/ivanjayapurna/low-n-protein-engineering}
JAX-UniRep Implementation: \url{https://github.com/ElArkk/jax-unirep}
\par\null
\section*{Acknowledgements}
{\label{687807}}
We would like to thank the Church Lab authors of both the UniRep and
Low-N protein engineering papers for their fantastic work which inspired
us to work on this project develop our interest in computational protein
engineering, as well as for sharing their trained weights with us; Eric
Ma and Arkadij Kummer for the brilliant JAX-UniRep reimplementation and
for teaching us a lot about best practices in working on open-source
collaborative projects and designing clean APIs; and Prof. Jennifer
Listgarten for introducing us to amazing research papers to read, and
inspiring us with the possibilities and potential of machine learning in
biology and chemistry.
\section*{Supplementary Material}
\beginsupplement
\subsection*{S1: Evotuning pre-training dataset
curation}
{\label{915462}}
TODO.
\par\null
\subsection*{S2:~ development and application of ranking-error
function}
{\label{915462}}
To optimize the parameters for the task of fitness score ranking, we were inspired to develop a new type of error function, $E_{r}$, defined as follows:
For a validation batch size of $n$, ranking-matrices $R \in \mathbb{R}^{n \times n}$ were constructed. For any two mutant variants in a given batch, $i,j \in B(n)$, the ranking-matrix elements were assigned binary values to represent whether the variant sequence of row $i$'s fitness score was greater than, less than, or equal to the fitness score of the variant in column $j$.
\begin{equation*}
a_{i,j} = \left \{
\begin{matrix}
i > j : 1 \\
i \leq j : 0
\end{matrix} \right .
\end{equation*}
The purpose of allowing less-than and equal values to both be represented by $a_{i,j} = 0$ is two-fold: first, it was necessary, as some of the experimentally reported fitness scores had identical values for different mutants; second, in the Metropolis-Hastings selection steps, there is no need to accept a new mutation if it doesn't yield improvement over the current sequence. Thus, the ranking-matrix elements assigned values of $a_{i,j} = 1$ provide a simple way to represent the fitness rankings of variants, by showing which variants of columns $j$ each variant of row $i$ has fitness improvements over.
Two of these ranking-matrices were constructed: one for the experimentally-determined fitness scores of a given batch, $\Phi_{E}$, and another for the model's predicted fitness scores for the input-representation of variants in that batch, $\Phi_{P}$.
A confusion matrix, $\Psi$, was produced from an inverse-truth comparison of the two ranking-matrices, such that it had values of 1 where the predicted fitness rankings were false with respect to experimental data, and 0 where the experimentally determined rankings were matched by predictions:
\begin{equation*}
\Psi = \Phi_{P} \oplus \Phi_{E}
\end{equation*}
The sum of the elements in confusion matrix was computed, and normalized by the total number of elements, yielding our ranking-error function:
\begin{equation*}
E_{r} = \frac{1}{n^{2}}\sum_{i,j} \psi_{i,j}
\end{equation*}
\subsubsection*{SD2: Derivation of one-hot additivity
criteria:}
{\label{674524}}\par\null
Given the fitness score for a single amino acid substitution, $y_{i,u}$, where $i$ denotes the backbone position of the mutation, and $u$ denotes the new amino acid present, we can define the mutant fitness in terms of its difference relative to the wild type fitness, $y_{wt}$:
\begin{equation*}
y_{(i,u)} = y_{wt} + \Delta y_{wt \to i,u }
\end{equation*}
where $\Delta y_{wt \to i,u } = y_{(i,u)} - y_{wt}$ .
Likewise, the predictive additive fitness of any of double-mutant, characterized by the substitution of amino acids $u$ and $v$ at positions $i$ and $j$ respectively, can be defined by:
\begin{equation*}
\begin{aligned}
\widehat{y}_{(i,u),(j,v)} &= y_{wt} + \Delta y_{wt \to i,u } + \Delta y_{wt \to j,v }\\
&= y_{(i,u)} + y_{(j,v)} - y_{wt}
\end{aligned}
\end{equation*}
A one-hot encoding of a protein sequence composed on $n$ amino acids, can represent it as the concatenation of $n$ consecutive sparse binary row-vectors $\phi_{u}$, with $u$ assigning the index of the vector element that is equal to 1, based on a defined ordering of the canonical amino acids:
\begin{equation*}
\begin{aligned}
\boldsymbol{s}_{wt} = \begin{bmatrix} \boldsymbol{\phi}_{1} & \cdots & \boldsymbol{\phi}_{i} & \cdots & \boldsymbol{\phi}_{n} \end{bmatrix}
\end{aligned}
\end{equation*}
Consider a wild type amino acid, $w$, at position $i$,
\begin{equation*}
\begin{aligned}
\boldsymbol{\phi}_{i}(w) = \begin{pmatrix} \cdots & 0 & 1 & \cdots & 0 & 0 & \cdots \end{pmatrix}
\end{aligned}
\end{equation*}
and a mutant amino acid, $u$, to substitute at position $i$:
\begin{equation*}
\begin{aligned}
\boldsymbol{\phi}_{i}(u) = \begin{pmatrix} \cdots & 0 & 0 & \cdots & 1 & 0 & \cdots \end{pmatrix}
\end{aligned}
\end{equation*}
we can define an additive operator to represent the change in one-hot representation when going from the wild-type sequence to the mutant sequence with amino acid $u$ substituted at position $i$:
\begin{equation*}
\begin{aligned}
\Delta \boldsymbol{s}_{wt \to (i,u) } &= \boldsymbol{s}_{(i,u)} - \boldsymbol{s}_{wt}\\
&= \begin{bmatrix} (\boldsymbol{\phi}_{1}(w) - \boldsymbol{\phi}_{1}(w)) & \cdots & (\boldsymbol{\phi}_{i}(u) - \boldsymbol{\phi}_{i}(w)) & \cdots & (\boldsymbol{\phi}_{n}(w) - \boldsymbol{\phi}_{n}(w)) \end{bmatrix}\\
&= \begin{bmatrix} \; \; \boldsymbol{\mathbf{0}} \; \; \cdots & ( \cdots & 0 & -1 & \cdots & 1 & 0 & \cdots ) & \cdots\; \; \boldsymbol{\mathbf{0}} \; \; \end{bmatrix}\\
\end{aligned}
\end{equation*}
We can extend the use of this operator in the generation of one-hot sequence encodings for multiple amino acid substitutions.
Again, considering a double-mutant with amino acids $u$ and $v$ substituted at positions $i$ and $j$, a one-hot sequence representation can be produces as follows:
\begin{equation*}
\begin{aligned}
\boldsymbol{s}_{(i,u),(j,v)} &= \boldsymbol{s}_{wt} + \Delta \boldsymbol{s}_{wt \to (i,u) } + \Delta \boldsymbol{s}_{wt \to (j,v) }\\
&= \boldsymbol{s}_{(i,u)} + \boldsymbol{s}_{(j,v)} - \boldsymbol{s}_{wt}
\end{aligned}
\end{equation*}
Given any linear model $F(s_{(i,u)} ; \boldsymbol{c}, b)$, which maps a one-hot encoding $s_{(i,u)}$ to the corresponding fitness value $y_{(i,u)}$, a fitting operation will be performed to optimize the values of weights $\boldsymbol{c}$ and bias $b$, such that:
\begin{equation*}
\begin{aligned}
F(\boldsymbol{s}_{(i,u)}) &= \begin{bmatrix} \boldsymbol{s}_{(i,u)} & 1 \end{bmatrix} \begin{bmatrix}\boldsymbol{c} \\ b \end{bmatrix}\\
&= y_{(i,u)} + \epsilon_{(i,u)}\\
\end{aligned}
\end{equation*}
where $\epsilon_{(i,u)}$ is the prediction error for a given mutation sample, which is to minimized by fitting the model's parameters. In the fitting operation, training elements include a feature-array, $\boldsymbol{S}$, and a fitness-vector, $\boldsymbol{y}$, corresponding to an experimentally derived data set of single amino acid substitutions.
The predictive model here is a linear operator, and thus abides by the property of being closed under additivity:
\begin{equation*}
\begin{aligned}
F\left( \sum_{(I,U)}\boldsymbol{s}_{(i,u)} \right) &= \sum_{(I,U)}F(\boldsymbol{s}_{(i,u)})\\
&= \sum_{(I,U)}\begin{bmatrix} \boldsymbol{s}_{(i,u)} & 1 \end{bmatrix} \begin{bmatrix}\boldsymbol{c} \\ b \end{bmatrix}\\
&= \sum_{(I,U)}y_{(i,u)} + \epsilon_{(i,u)}
\end{aligned}
\end{equation*}
Providing the linear predictive model with the one-hot encoding of a double-mutation for any two amino acids, $u$ and $v$, substituted at positions $i$ and $j$, we produce additive fitness scores, offset linearly by the error values if their corresponding single-mutations:
\begin{equation*}
\begin{aligned}
F(\boldsymbol{s}_{(i,u),(j,v)}) &= F \left( \boldsymbol{s}_{(i,u)} + \boldsymbol{s}_{(j,v)} - \boldsymbol{s}_{wt} \right) \\
&= F(\boldsymbol{s}_{(i,u)}) + F(\boldsymbol{s}_{(j,v)}) - F(\boldsymbol{s}_{wt})\\
&= \left( y_{(i,u)} + \epsilon_{(i,u)} \right) + \left( y_{(j,v)} + \epsilon_{(j,v)} \right) - \left( y_{wt} + \epsilon_{wt} \right)\\
&= \left( y_{wt} + \Delta y_{wt \to i,u } + \Delta y_{wt \to j,v } \right) + \left( \epsilon_{(i,u)} + \epsilon_{(j,v)} -\epsilon_{wt}\right)\\
&= \widehat{y}_{(i,u),(j,v)} + \left( \epsilon_{(i,u)} + \epsilon_{(j,v)} -\epsilon_{wt}\right)\\
\end{aligned}
\end{equation*}
If $\boldsymbol{s}_{(i,u)}$,$\boldsymbol{s}_{(j,v)}$, and $\boldsymbol{s}_{wt}$ are all in the training set used for fitting the parameters of the predictive model, and if the fitting occurs within an over-determined system where the number of samples is less than the number of elements in each one-hot representation ($n \times 20$), then the regression will fit closely to all training data, such that the values of $\epsilon_{(i,u)}$, $\epsilon_{(j,v)}$, and $\epsilon_{wt}$ are negligible.
In such a case, we find the predicted value of the one-hot encoding of a double-mutant to be:
\begin{equation*}
\begin{aligned}
F(\boldsymbol{s}_{(i,u),(j,v)}) \simeq \widehat{y}_{(i,u),(j,v)}
\end{aligned}
\end{equation*}
Therefore, the fitness value for a double-mutant, as predicted by a linear model that is unconstrained in fitting to a training set that includes the constituent single-mutations, is approximately equal to the additive fitness scores for that double-mutant.
\subsection*{Supplementary Figures:}
{\label{553908}}\par\null\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/data-histo-distributions/data-histo-distributions}
\caption{{This is a caption
{\label{220521}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.98\columnwidth]{figures/alpha-optimization-fig/alpha-optimization-fig}
\caption{{This is a caption
{\label{411776}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.98\columnwidth]{figures/regression-pred-plots/regression-pred-plots}
\caption{{This is a caption
{\label{972712}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/improvement-over-random-sampling/improvement-over-random-sampling-NEW-ERROR-FUNC}
\caption{{This is a caption
{\label{678345}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.42\columnwidth]{figures/unirep-MS2-double-preds-compressed/unirep-MS2-double-preds-compressed}
\caption{{This is a caption
{\label{427767}}%
}}
\end{center}
\end{figure}
\selectlanguage{english}
\FloatBarrier
\section*{References}\sloppy
\phantomsection
\label{csl:1}1.Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. \& Church, G. M. {Low-N protein engineering with data-efficient deep learning}. \textit{bioRxiv} 2020.01.23.917682 (2020) doi:10.1101/2020.01.23.917682.
\phantomsection
\label{csl:2}2.Lee, D., Redfern, O. \& Orengo, C. {Predicting protein function from sequence and structure.}. \textit{Nat Rev Mol Cell Biol} \textbf{8}, 995–1005 (2007).
\phantomsection
\label{csl:3}3.Dill, K. A. \& MacCallum, J. L. {The protein-folding problem, 50 years on.}. \textit{Science} \textbf{338}, 1042–6 (2012).
\phantomsection
\label{csl:4}4.Romero, P. A. \& Arnold, F. H. {Exploring protein fitness landscapes by directed evolution.}. \textit{Nat Rev Mol Cell Biol} \textbf{10}, 866–76 (2009).
\phantomsection
\label{csl:5}5.Wintrode, P. L. \& Arnold, F. H. {Temperature adaptation of enzymes: lessons from laboratory evolution.}. \textit{Adv Protein Chem} \textbf{55}, 161–225 (2000).
\phantomsection
\label{csl:6}6.Maheshri, N., Koerber, J. T., Kaspar, B. K. \& Schaffer, D. V. {Directed evolution of adeno-associated virus yields enhanced gene delivery vectors.}. \textit{Nat Biotechnol} \textbf{24}, 198–204 (2006).
\phantomsection
\label{csl:7}7.Smith, J. M. {Natural selection and the concept of a protein space.}. \textit{Nature} \textbf{225}, 563–4 (1970).
\phantomsection
\label{csl:8}8.Ogbunugafor, C. B. {A Reflection on 50 Years of John Maynard Smith's Protein Space.}. \textit{Genetics} \textbf{214}, 749–754 (2020).
\phantomsection
\label{csl:9}9.Arnold, F. H. {Combinatorial and computational challenges for biocatalyst design.}. \textit{Nature} \textbf{409}, 253–7 (2001).
\phantomsection
\label{csl:10}10.Hopf, T. A. \textit{et al.}. {Mutation effects predicted from sequence co-variation.}. \textit{Nat Biotechnol} \textbf{35}, 128–135 (2017).
\phantomsection
\label{csl:11}11.Yang, K. K., Wu, Z. \& Arnold, F. H. {Machine-learning-guided directed evolution for protein engineering.}. \textit{Nat Methods} \textbf{16}, 687–694 (2019).
\phantomsection
\label{csl:12}12.Beard, H., Cholleti, A., Pearlman, D., Sherman, W. \& Loving, K. A. {Applying physics-based scoring to calculate free energies of binding for single amino acid mutations in protein-protein complexes.}. \textit{PLoS One} \textbf{8}, e82849 (2013).
\phantomsection
\label{csl:13}13.Release, S. {3: MacroModel, Schr{\"o}dinger, LLC, New York, NY, 2014}. \textit{No Title} (2014).
\phantomsection
\label{csl:14}14.Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. \& Church, G. M. {Unified rational protein engineering with sequence-based deep representation learning.}. \textit{Nat Methods} \textbf{16}, 1315–1322 (2019).
\phantomsection
\label{csl:15}15.Hartman, E. C. \textit{et al.}. {Quantitative characterization of all single amino acid variants of a viral capsid-based drug delivery vehicle.}. \textit{Nat Commun} \textbf{9}, 1385 (2018).
\phantomsection
\label{csl:16}16.Tournier, V. \textit{et al.}. {An engineered PET depolymerase to break down and recycle plastic bottles.}. \textit{Nature} \textbf{580}, 216–219 (2020).
\phantomsection
\label{csl:17}17.Austin, H. P. \textit{et al.}. {Characterization and engineering of a plastic-degrading aromatic polyesterase.}. \textit{Proc Natl Acad Sci U S A} \textbf{115}, E4350–E4357 (2018).
\phantomsection
\label{csl:18}18.Yang, K. K., Wu, Z., Bedbrook, C. N. \& Arnold, F. H. {Learned protein embeddings for machine learning.}. \textit{Bioinformatics} \textbf{34}, 4138 (2018).
\phantomsection
\label{csl:19}19.Heinzinger, M. \textit{et al.}. {Modeling aspects of the language of life through transfer-learning protein sequences.}. \textit{BMC Bioinformatics} \textbf{20}, 723 (2019).
\phantomsection
\label{csl:20}20.Raimondi, D., Orlando, G., Vranken, W. F. \& Moreau, Y. {Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis.}. \textit{Sci Rep} \textbf{9}, 16932 (2019).
\phantomsection
\label{csl:21}21.Buchan, D. W. A. \& Jones, D. T. {Learning a functional grammar of protein domains using natural language word embedding techniques.}. \textit{Proteins} \textbf{88}, 616–624 (2020).
\phantomsection
\label{csl:22}22.Firnberg, E., Labonte, J. W., Gray, J. J. \& Ostermeier, M. {A comprehensive, high-resolution map of a gene's fitness landscape.}. \textit{Mol Biol Evol} \textbf{31}, 1581–92 (2014).
\phantomsection
\label{csl:23}23.Cui, Y. \textit{et al.}. {Computational redesign of PETase for plastic biodegradation by GRAPE strategy}. \textit{bioRxiv} (2019) doi:10.1101/787069.
\phantomsection
\label{csl:24}24.Peabody, D. S. {Subunit fusion confers tolerance to peptide insertions in a virus coat protein}. \textit{Archives of biochemistry and biophysics} \textbf{347}, 85–92 (1997).
\phantomsection
\label{csl:25}25.ElSohly, A. M. \textit{et al.}. {Synthetically modified viral capsids as versatile carriers for use in antibody-based cell targeting}. \textit{Bioconjugate chemistry} \textbf{26}, 1590–1596 (2015).
\phantomsection
\label{csl:26}26.Kovacs, E. W. \textit{et al.}. {Dual-surface-modified bacteriophage MS2 as an ideal scaffold for a viral capsid-based drug delivery system}. \textit{Bioconjugate chemistry} \textbf{18}, 1140–1147 (2007).
\phantomsection
\label{csl:27}27.Potter, S. C. \textit{et al.}. {HMMER web server: 2018 update.}. \textit{Nucleic Acids Res} \textbf{46}, W200–W204 (2018).
\phantomsection
\label{csl:28}28.Ma, E. J. \& Kummer, A. {Reimplementing Unirep in JAX}. (2020).
\phantomsection
\label{csl:29}29.Pokusaeva, V. \textit{et al.}. {Experimental assay of a fitness landscape on a macroevolutionary scale}. \textit{bioRxiv} 222778 (2018).
\phantomsection
\label{csl:30}30.Sarkisyan, K. S. \textit{et al.}. {Local fitness landscape of the green fluorescent protein}. \textit{Nature} \textbf{533}, 397 (2016).
\phantomsection
\label{csl:31}31.Vaswani, A. \textit{et al.}. {Attention Is All You Need}. \textit{CoRR} \textbf{abs/1706.03762}, (2017).
\end{document}