The research described herein focuses on reproducing the methodology described by Biswas et al and applying the principles of theirapproach three natural protein systems. In our exploration of these methods, we have found several key componentsin the reported pipeline that have pronounced influence on the ability to predict the effect of mutagenesis on proteinfunction. Additionally, our research has led us to identify factors within this framework that limit predictive accuracy,as well as ways to improve such performance through modified approaches
This work is an analysis of how one can use unsupervised deep learning to discover features in proteins, and how these features can assist in protein engineering and biohybrid materials design.
Optimization of this pipeline will be critical in the future of protein engineering for biomedical technologies and biohybrid materials designs [LINK TO CAPSID AND GRAPE PAPERS]...
- One of the model systems explored in this research is on reengineering the icosahedral coat-protein of the MS2 bacteriophage, due to the availability of extensive fitness data for mutated variants of this protein \cite{hartman}, and the ability of virus-like nanoparticles to provide a stable scaffolding for the development of nanomaterials, with a versatile range of new functionality and a remarkable tolerance to extensive synthetic modification \cite{peabody,elsohly,kovacs}.
Papers to read and add to intro / cite:
\cite{Yang2018}
\cite{Yang2019}
Method
Existing Pipeline
We initially re-implemented the 5 in-silico steps of the low-N protein engineering pipeline as outlined in Biswas et al \cite{Biswas2020} for a target protein. This in-silico method assumes prior wet-lab functional characterization of a small number (low-N) of random mutants of the wild-type target protein:
- Learn a global, 1900-dimensional protein feature representation by training a mLSTM to do next character prediction on > 20 million raw (unlabeled) amino acid sequences \cite{Alley2019}. This is the "UniRep" representation.
- Curate a smaller (order of 10,000) dataset of characterized protein sequences that are known to be evolutionarily related to the target protein.
- Learn features local to the target protein by fine-tuning the weights from step 1 on the dataset from step 2. This is the "evotuned UniRep" or "eUniRep" representation.
- Train a "simple" supervised top model (enseble ridge regression with sparse refit) on the 1900-dimensional feature space representation of a small number (<100) of characterized (labeled) wild-type mutants of the target protein (obtained by passing the protein sequences through the mLSTM trained in step 3). This top model provides an end-to-end sequence-to-function model for proteins local to the target protein.
- Markov Chain Monte Carlo (MCMC) based in silico directed evolution on the target protein, outputting mutated sequences that are predicted by the top model in step 4 to have the most improved function relative to wild-type (>WT). Although the in-silico pipeline is complete, the top mutant sequences must then be functionally characterized to complete the full directed evolution cycle and confirm functional improvement in the engineered protein sequences.
As the training in step 1 was reported to take 3.5 weeks of wall-clock time on an AWS instance, we opted to use the weights provided by Alley et al \cite{Alley2019} as our starting point. Due to the minimal open-source code from the authors of the low-N eUniRep pipeline, we largely re-implemented the remainder of the pipeline from scratch, following to the best of our abilities the often vague or convoluted method descriptions in Biswas et al. We faced 3 key issues with our re-implementation:
- EBI JackHMMer web application \cite{Potter2018} unable to handle the volume of outputs, insufficient local hard drive space to download the ReferenceProteomes database to perform the search locally.
- The mLSTM model implemented on TensorFlow used too much memory and was unable to process sequences >275 amino acids in length with 16GB of RAM.
- We didn't understand how to implement the sparse refit, nor did the descriptions in the low-N paper make the procedure sound at all necessary. Not that there was any figure showing / quantitatively evaluating different top model performances. The entire supervised top model analysis in the low-N paper we found all together lacking.
Implemented Improvements
To address the 3 key issues as well as other quality of life improvements for end-user usability, we made the following improvements to the pipeline:
- Formalized procedure drafted for the curation of local dataset for evolutionary fine-tuning.
- Replaced the TensorFlow implementation with a JAX implementation! Gain improvements in memory required as well as speed.
- Implemented a thorough top model evaluation script and settled on the best top model as...
- Added computational evaluation for thermostability fitness. Note that fitness can be a wide variety of things, thermostability is just one of them, for which computational verification happens to be possible.
Additional Analysis
Biswas et al verifies the ... by generating these plots ... they show success on 2 datasets avGFP and TEM-1 beta lactamase...
We wanted to more thoroughly verify both the viability and necessity of eUniRep for in-silico low-N protein engineering. This was done by replication and deeper analysis of TEM-1 beta lactamase study as well as application of the pipeline to 2 new target proteins, both using our improved pipeline. Did some cool epistasis analysis as well by comparing training on single and double mutants of MS2 capsid protein.
Results
Comparing our improved pipeline