Introduction

The foundation of protein engineering lies in introducing new mutations for the purpose of modifying protein function. The first attempts to rationally design new mutants with improved function relied on leveraging chemical and biological domain knowledge to correlate human interpretable patterns and motifs in secondary and tertiary folded protein structures with possible functionality improvements  \cite{Lee2007}. These traditional predictive pipelines require accurate characterization or prediction of protein folded 3D structures, which is a difficult task \cite{Dill2012}. Thus one of the grand challenges of modern computational protein engineering is in skipping structure entirely, and instead developing accurate models to predict direct amino acid sequence-to-function relationships. Directed evolution is a hypothesis-free, non-structure dependent approach that has been efficiently utilized in order to optimize the functionality of bio-molecular systems through the improvement of selected traits \cite{Romero2009,Wintrode2000,Maheshri2006}. However, the original directed evolution pipeline is both slow and resource intensive - requiring multiple high throughput iterations of sequencing and functional characterization. Moreover, desired protein function is often difficult to test experimentally in high throughput, thus a correlated proxy fitness function is instead used as the predictive label. The search space (fitness landscape) of possible amino acid residue mutations is both immense in size and sparse in protein structures that actually fold \cite{Smith1970,Ogbunugafor2020}, let alone that yield improved fitness. Even when restricted to considering the number of variant primary sequences that can result from just two co-expressed mutations, it is virtually impossible to test the fitness of all possible combinations through physical synthesis or even computational modeling \cite{Romero2009,Arnold2001,Hopf2017}. This necessitates the development of more sophisticated predictive models. Machine learning techniques can be used to computationally guide the directed evolution process, reducing the number of experimental cycles required \cite{Yang2019}.
A remaining key bottleneck for even further improvements in speed and experimental cost reduction, is in the high number (N) of initial wild type mutants (training set) needed to develop any predictive model. In a recent manuscript, the Church Lab have solved this issue using data-efficient deep learning to achieve low-N protein engineering \cite{Biswas2020}. The approach described by Biswas et al holds promise as a powerful tool in protein engineering: given only a small library of synthesized and characterized protein variants, and a single directed evolution cycle, one could engineer extensively modified proteins with optimized functionality. Several existing protein engineering procedures and softwares utilize a common Markov-chain Monte Carlo (MCMC) simulation regime with a Metropolis-Hastings acceptance criteria defined by a Boltzmann probability distribution, with a temperature parameter governing the rate at which new mutations are adopted over current states throughout the mutagenesis trajectory. The relative favorability of proposed mutations is approximated by statistical mechanics based energy calculations \cite{Beard2013,release20143}. However, these approaches fall short in their ability to accurately predict the fitness of a given mutation, as their applied calculations are largely restricted to a protein’s native folded state, and cannot predict whether a mutation allows the protein to assemble through a folding pathway to reach that state in the first place. The in silico directed evolution protocol in Biswas et al makes 2 major improvements to existing schemes. The first, is in exploiting the often ignored dark proteome by doing unsupervised pre-training on millions of uncharacterized protein sequences to learn a semantically rich, universal representation (UniRep) of protein sequences \cite{Alley2019}. This represention is further fine-tuned on a smaller, curated subset of evolutionarily relevant sequences to get a final "evotuned" representation (eUniRep). The second key insight is in using a non-physical Boltzmann distribution that is a function of eUniRep-dependent fitness predictions in their MCMC in silico directed evolution. In this non-physical Boltzmann distribution, the exponent numerator argument is the difference between a proposed mutant’s fitness and the current sequence’s fitness value. A significant benefit is expected to be found through the use of this scheme, as the eUniRep representations are trained to encode features containing information on the natural laws that govern favorable folding and function capabilities for proteins. Optimization of this pipeline will be critical in the future of protein engineering for the development of various technologies ranging from nanomaterials and drug delivery \cite{Hartman2018} to enzymatic plastic degradation \cite{Tournier2020,Austin2018}. In this work we provide a complete, open-source, end-to-end re-implementation of the in silico UniRep protein engineering pipeline with improved computational efficiency,  more detailed documentation, cleaner API and additional features to lower the barrier to entry for use of this pipeline as an engineering tool.
The cornerstone to the success of the low-N pipeline is the eUniRep representation. Better features leading to improved model peformance is a well established fact in machine learning, and it is becoming a more ubiquitously exploited way to improve model performance in protein bioinformatics as well \cite{Yang2018,Heinzinger2019,Raimondi2019,Buchan2020}. To further explore the success and necessity of the eUniRep feature space, we use our modified pipeline to compare the performance of various protein feature representations, with various supervised models, on 3 vastly different proteins, optimizing for 3 different functions: (1) Increasing catalytic rate of TEM-1 β-lactamase (PDB: 1ZG4), using wild type and mutant fitnesses from Firnberg et al \cite{Firnberg2014}, as a reproducibility study to try replicate and add to the results and analysis presented in Biswas et al. (2) Increasing the melting temperature of Ideonella sakaiensis PETase (IsPETase, PDB: 6QGC), a recently discovered enzyme that can degrade polyethylene terephthalate (PET) and other aromatic polyesters. We chose PETase as plastic waste is an ever growing problem demands an immediate solution. PETase is revolutionary in its ability to degrade the ubiquitously used PET under mild conditions, however, its low melting temperature and inherent instability greatly limit its practical use on an industrial scale - thus the need to engineer more thermally stable mutants. Moreover, as there are a multitude of research groups working on PETase engineering, it allows us to directly compare the eUniRep in silico directed evolution protein engineering approach to other recently published computational protein engineering pipelines, such as the GRAPE method \cite{bian2019}. Cui et al have used greedy algorithms and k-means clustering to design a IsPETase mutant with a melting temperature elevated by 31 degrees from wild type, using a similarly low-N number of starting mutants. (3) Finally, we chose the icosahedral coat-protein of the MS2 bacteriophage (PDB:2MS2), due to the availability of extensive fitness data for mutated variants of this protein allowing for in depth epistasis analysis \cite{Hartman2018}, as well as the ability of virus-like nanoparticles to provide a stable scaffolding for the development of nanomaterials, with a versatile range of new functionality and a remarkable tolerance to extensive synthetic modification \cite{peabody,elsohly,kovacs}

Method

Existing Pipeline

We initially re-implemented the 5 in-silico steps of the low-N protein engineering pipeline as outlined in Biswas et al \cite{Biswas2020} for a target protein. This in-silico method assumes prior wet-lab functional characterization of a small number (low-N) of random mutants of the wild-type target protein:
  1. Learn a global, 1900-dimensional protein feature representation by training a mLSTM to do next character prediction on > 20 million raw (unlabeled) amino acid sequences \cite{Alley2019}. This is the "UniRep" representation.
  2. Curate a smaller (order of 10,000) dataset of characterized protein sequences that are known to be evolutionarily related to the target protein.
  3. Learn features local to the target protein by fine-tuning the weights from step 1 on the dataset from step 2. This is the "evotuned UniRep" or "eUniRep" representation.
  4. Train a "simple" supervised top model (enseble ridge regression with sparse refit) on the 1900-dimensional feature space representation of a small number (<100) of characterized (labeled) wild-type mutants of the target protein (obtained by passing the protein sequences through the mLSTM trained in step 3). This top model provides an end-to-end sequence-to-function model for proteins local to the target protein.
  5. Markov Chain Monte Carlo (MCMC) based in silico directed evolution on the target protein, outputting mutated sequences that are predicted by the top model in step 4 to have the most improved function relative to wild-type (>WT). Although the in-silico pipeline is complete, the top mutant sequences must then be functionally characterized to complete the full directed evolution cycle and confirm functional improvement in the engineered protein sequences.
As the training in step 1 was reported to take 3.5 weeks of wall-clock time on an AWS instance, we opted to use the weights provided by Alley et al \cite{Alley2019} as our starting point. Due to the minimal open-source code available, we largely re-implemented the remainder of the pipeline from scratch, following provided method descriptions in Biswas et al. We identified 5 key areas for improvement with our re-implementation:

1. Curation of evolutionarily similar sequences for evotuning

The EBI JackHMMer web application \cite{Potter2018} used to curate the training set of sequences for evotuning is unable to reliably handle a high volume of output hits for similar protein sequences. The alternative to install and run HMMer locally requires downloading the entire ReferenceProteomes database, which was unrealistic given our local storage constraints. A potential workaround could be to script around a local install of HMMer to search the ReferenceProteomes database from an online source. In our modified pipeline, we draft a formalized procedure for the curation of the local dataset using the search using the Pfam and InterPro online databases and APIs (Supplementary 1).