Future Work
There are several avenues for further exploration and further pipeline improvement. In this report we primarily explore top model and directed evolution, due to the computational and time constraings involved in training the mLSTM for generating UniRep and evotuned weights. As a there is a lot of further work that can be done with the pre-training model. One curious finding from our few evotuning runs (Fig 1a) is that evotuning loss does not correlate with top model performance, per our custom scoring function. In order to further probe this relationshp and determine the optimal stopping point for mLSTM training, directed evolution performance should be validated for various combinations of different learning rates and number of epochs run. Moreover, given the uncertainty in the source of the superior performance of the TEM-1 evotuned weights over the IsPETase evotuned weights on the IsPETase dataset, it could be helpful to retrain the UniRep weights from scratch and compare to the performance of the Church Lab provided UniRep weights to eliminate a faulty JAX re-implementation as a potential error source. One of the key differences between the JAX and TensorFlow (TF) implementations of the mLSTM are that the TF model pads sequences of different lengths and trains in fixed batch sizes, whereas the JAX implementation does not pad sequences, training in variable length batches of fixed sequence lengths (to achieve computational speedup). In practice, it is often seen that fixed batch-size training will yield better convergence, perhaps by modifying our implementation to train on fixed batch sizes (while still maintaining computational speedup) top model performance can be further improved. Finally, as is happening in various fields in bioinformatics and natural language processing, mLSTMs are being replaced by transformers \cite{Vaswani2017}, this is something that could be tried here.
Ultimately however, the most important future work that needs to be done is experimental characterization of our directed evolution outputs. We believe the biggest contribiution of this work to be our complete, open-source, generalized, end-to-end re-implemented pipeline, and we hope researchers and engineers will find it helpful for use in their work. Perhaps someone reading this will have the lab setup to help do the experimental characterization, either on our 3 protein case studies, or a novel one of their own!
Software Repository