3. Rigorous top model selection and fine tuning

Biswas et al only report performance comparisons for 4 top model types: Lasso-Lars, Ridge, Ridge with sparse refit, and ensemble ridge with sparse refit - all with minimal information presented on what hyperparameters were used and how the well the top models fit. Moreover, top model performance comparisons are only given in the form of retrospective experiments for avGFP (Green Fluorescent protein Aequorea victoria), based off which it is both unconvincing that (A) pre-training on 20 million sequences is required, as Local eUniRep appears to show comparable performance, and (B) that ensembled ridge regression with sparse refit is required, as it only demonstrates minor improvement over basic ridge regression in fold improvement over random sampling, while requiring a greater training time and a larger number of hyperparameters to tune. Motivated by the lack of reported results comparing and evaluating various top model performances, we implemented a thorough, general top model evaluation script that can be run on each new target protein to determine the best top model and hyperparameters for a given dataset.

4. New scoring function to evaluate top model performance

To quantify the performance of a predictive model within this specific context of application, it is important to acknowledge the important factors in its ultimate use case: the Metropolis-Hastings acceptance criteria. In the MCMC simulations that emulate directed evolution, the probability a proposed mutant sequence is accepted as the new sequence at each iteration is dependent on if the proposed sequence has a greater fitness score than the current sequence. Thus, an appropriate top model would not necessarily prioritize predicting fitness scores that optimize the average closeness to the experimental fitness scores, but rather prioritize accurately predicting the relative magnitude of each fitness score with respect to the values of other mutant sequences being considered. In Biswas et al, the only purely in silico top model characterization is done with the following scoring metric: use a given top model to do a ranked sorting of a holdout set of mutant sequences. The score given is the counts of mutants in the sorted top 10% that have fitness greater than wild type, as a ratio over averaged, random 10% samplings. The shortcoming of this scoring scheme is its lack of generalizability. The choice to look at the top 10% is an arbitrary one, and it requires as input the wild type fitness value, specific to each protein and function. Furthermore, scores will not be comparable across different datasets, even within the same protein and function, as the scores are highly dataset distribution dependent. To solve these issues and optimize parameters for the task of fitness score ranking, we developed a new scoring function (Supplementary 2).