Papers to read and add to intro / cite:
\cite{Yang2018}
\cite{Yang2019}

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6858301/
https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.25842

Method

Existing Pipeline

We initially re-implemented the 5 in-silico steps of the low-N protein engineering pipeline as outlined in Biswas et al \cite{Biswas2020} for a target protein. This in-silico method assumes prior wet-lab functional characterization of a small number (low-N) of random mutants of the wild-type target protein:
  1. Learn a global, 1900-dimensional protein feature representation by training a mLSTM to do next character prediction on > 20 million raw (unlabeled) amino acid sequences \cite{Alley2019}. This is the "UniRep" representation.
  2. Curate a smaller (order of 10,000) dataset of characterized protein sequences that are known to be evolutionarily related to the target protein.
  3. Learn features local to the target protein by fine-tuning the weights from step 1 on the dataset from step 2. This is the "evotuned UniRep" or "eUniRep" representation.
  4. Train a "simple" supervised top model (enseble ridge regression with sparse refit) on the 1900-dimensional feature space representation of a small number (<100) of characterized (labeled) wild-type mutants of the target protein (obtained by passing the protein sequences through the mLSTM trained in step 3). This top model provides an end-to-end sequence-to-function model for proteins local to the target protein.
  5. Markov Chain Monte Carlo (MCMC) based in silico directed evolution on the target protein, outputting mutated sequences that are predicted by the top model in step 4 to have the most improved function relative to wild-type (>WT). Although the in-silico pipeline is complete, the top mutant sequences must then be functionally characterized to complete the full directed evolution cycle and confirm functional improvement in the engineered protein sequences.
As the training in step 1 was reported to take 3.5 weeks of wall-clock time on an AWS instance, we opted to use the weights provided by Alley et al \cite{Alley2019} as our starting point. Due to the minimal open-source code from the authors of the low-N eUniRep pipeline, we largely re-implemented the remainder of the pipeline from scratch, following to the best of our abilities the often vague or convoluted method descriptions in Biswas et al. We faced 3 key issues with our re-implementation:
  1. EBI JackHMMer web application \cite{Potter2018} unable to handle the volume of outputs, insufficient local hard drive space to download the ReferenceProteomes database to perform the search locally.
  2. The mLSTM model implemented on TensorFlow used too much memory and was unable to process sequences >275 amino acids in length with 16GB of RAM.
  3. We didn't understand how to implement the sparse refit, nor did the descriptions in the low-N paper make the procedure sound at all necessary. Not that there was any figure showing / quantitatively evaluating different top model performances. The entire supervised top model analysis in the low-N paper we found all together lacking.

Implemented Improvements

To address the 3 key issues as well as other quality of life improvements for end-user usability, we made the following improvements to the pipeline:
  1. Formalized procedure drafted for the curation of local dataset for evolutionary fine-tuning.
  2. Replaced the TensorFlow implementation with a JAX implementation! Gain improvements in memory required as well as speed.
  3. Implemented a thorough top model evaluation script and settled on the best top model as...
  4. Added computational evaluation for thermostability fitness. Note that fitness can be a wide variety of things, thermostability is just one of them, for which computational verification happens to be possible.

Additional Analysis

Biswas et al verifies the ... by generating these plots ... they show success on 2 datasets avGFP and TEM-1 beta lactamase...
We wanted to more thoroughly verify both the viability and necessity of eUniRep for in-silico low-N protein engineering. This was done by replication and deeper analysis of TEM-1 beta lactamase study as well as application of the pipeline to 2 new target proteins, both using our improved pipeline. Did some cool epistasis analysis as well by comparing training on single and double mutants of MS2 capsid protein.

Results

Comparing our improved pipeline