- Biswas et al only report performance comparisons for 4 top model types: Lasso-Lars, Ridge, Ridge with sparse refit, and ensemble ridge with sparse refit - all with minimal information presented on what hyperparameters were used and how the well the top models fit. Moreover, top model performance comparisons are only given in the form of retrospective experiments for avGFP, based off which it is both unconvincing that (A) pre-training on 20 million sequences is required, as Local eUniRep appears to show comparable performance, and (B) that ensembled ridge regression with sparse refit is required, as it only demonstrates minor improvement over basic ridge regression in fold improvement over random sampling, while requiring a greater training timeand a larger number of hyperparameters to tune.
To address the 3 key issues as well as to add some quality of life improvements for end-user usability, we made the following modifications to the pipeline:
- Collaborated to complete an open-source JAX re-implementation of the mLSTM to get UniRep representations of sequences, as well as for evotuning to get eUniRep sequence representations. JAX was chosen to achieve 100x speedup in passing new protein sequences through the trained mLSTM to get the 1900-dimensional UniRep / eUniRep representations. Technical implementation details and performance comparisons can be read in Ma et al \cite{kummer2020}. The JAX implementation crucially had the added benefit of requiring less memory to run, affording the ability to evotune on longer protein sequences.
- Motivated by the lack of reported results comparing and evaluating various top model performances, we implemented a thorough, general top model evaluation script that can be run on each new target protein to determine the best top model and do hyperparamter tuning.
- To quantify the performance of a predictive model within this specific context of application, it is important to acknowledge the important factors in its ultimate use case: the Metropolis-Hastings acceptance criteria. In the MCMC simulations that emulate directed evolution, the probability a proposed mutant sequence is accepted as the new sequence at each iteration is dependent on if the proposed sequence has a greater fitness score than the current sequence. Thus, an appropriate top model would not necessarily prioritize predicting fitness scores that optimize the average closeness to the experimental fitness scores, but rather prioritize accurately predicting the relative magnitude of each fitness score with respect to the values of other mutant sequences being considered. In Biswas et al, the only top model evaluation that is done is on characterized
- To optimize the parameters for the task of fitness score ranking, we developed a new type of error function (Supplementary Discussion 1). We use this top model evaluation metric in parallel with a scoring scheme used in Biswas et al, which uses a given top model to do a ranked sorting of a holdout set of mutant sequences. The score given is the counts of mutants in the sorted top 10% that have fitness greater than wild type, as a ratio over averaged, random 10% samplings.
- Our entire pipeline is documented, open source, and customizable.