at the end of this section we should have shown that one-hot (i.e. linear kernel) == linear addivitity and you can see a mathematical proof confirming this. Here we talk about all of our conclusions on one-hot vs unirep/eunirep - we should conclude that EUNIREP IS INDEED NECESSARY, AND ONE-HOT IS ACTUALLY GARBAGE.
Top model selection and tuning
In our analysis so far we have assumed the use of an optimized ridge regression top model. For the purpose of finding a model that could accurately predict the effects of a new mutation for a given sequence , we tested the predictive accuracy of several linear regressions, with a range of hyperparameters. While various regression models were tested inclusive of Lasso, Huber, RANSAC, and ensemble methods, we found minimal performance improvement with increased model complexity over simple linear ridge regression. Ridge regression has a single regularization hyperparameter that can be optimized by cross-validation. In Biswas et al, a sparse refit is used with the effect of increasing regularization strength, to prevent overfitting to the starting mutant dataset which is likely not representative of the actual fitness landscape. In our pipeline the scoring function for cross-validation is set as our custom scoring function (Supplementary 2). For each protein evaluated performance at various training set sizes ranging from N = 8 to 1600. Add something about how we see performance differing as we vary N (although i now see that stuff is included further up).
MCMC in silico directed evolution
Top-model training batch sizes and hyperparameters were selected before simulated directed evolution runs for each protein separately. Hyperparameters and batch sizes were chosen based on validation-set fitness predictions (Fig 2a), using a ranking-error function described in (Supplemental discussion); selected alpha values were 1e-2 for MS2 and PETase, and 1e-3 for Beta-lactamase; batch sizes were 240, 280, and 71, for Beta-lactamase, MS2, and PETase, respectively. Our evolution trajectories were ran for 25 iterations of random mutagenesis coupled with selection steps based on the predicted fitness of proposed mutations (Fig 2b). While we were able to effectively incorporate this component into our pipeline, and are therefore able to produce a large number of candidate mutants for the evaluated proteins, experimental validation is still required in order to determine whether the proposed variants do in fact possess the desired trait enhancements.