Results & Discussion
The success and necessity of feature representations varies across 3 different proteins
[ANDREW: update this entire section needs to be updated with new scoring function results - but the general structure is in place, i also think we should just group PCA into this section as its in the figures anyway. Just write about the results and i'll add some words about the significance of it.] Our first objective was to confirm our pipeline is able to reproduce the results presented for TEM-1 β-lactamase in Biswas et al. To using our fitness-ranking error function, we calculated the percent reduction in ranking-error for four different mutant datasets, using several different embedded sequence representations, with performance compared across five different training-batch sizes (Fig 1b). For the TEM-1 β-lactamase dataset, percent error reduction was calculated for training-batch sizes of N = 24, 48, 72, 96, and 120; the best reduction in error, averaged across all batch sizes, was achieved using an evotuned representation (average error reduction = 26.05%), followed closely by the Global UniRep representation (20.51%). These results were compared against a one-hot encoding, which performed worse than the UniRep inputs, with a 6.50% average reduction in ranking-error.
In evaluating performance using our single-mutant MS2 dataset, we see that the evotuned representation and Global UniRep only produce minor improvements in ranking error (6.62% and 9.19% error reduction for Global UniRep and evotuned UniRep, respectively), while the one-hot encodings performed only slightly worse than Global UniRep, with a 6.35% error reduction over random sampling. Additionally, we tested performance of a locally-evotuned representation, which was trained for 25 epochs after being initialized with random parameters, rather than the Global UniRep weights. The locally-evotuned representation gave the worst performance of all representations tested, with a 3.48% increase in ranking-error over random sampling. We repeated this analysis using a second data set of MS2 mutants, containing fitness scores for all possible combinations of double amino acid substitutions within the regions of residues 71 to 76. When using these data, we found a significant improvement in our top-model’s ability to predict the relative fitness rankings of test-batch mutants, with % error reduction of 33.09, 39.80, 4.39, and 37.65% for Global UniRep, evotuned-UniRep, locally-UniRep, and one-hot encodings, respectively, averaged over training batch sizes of N = 24, 48, 72, 96, and 120. For both single and and double mutant data sets, the worst predictive performance was seen when using local-UniRep; we suspect that this is due to the low number of evotuning epochs used to produce this representation’s parameters, and that this short training period is not sufficient for getting from a random initial state to the deep-learned sequence embeddings possessed by the globally trained UniRep.
The same analysis described above was performed on our 85-sample PETase data set, but with smaller training and test batch sizes, due to having less data available. In addition to comparing performance using UniRep, PETase-eUniRep, and one-hot encodings, we tested predictive performance using weights that had been evotuned for TEM-1 Beta-lactamase, which were provided by Biswas et al. Averaged over training-batch sizes of N = 24, 36, 48, 60, and 72, our predictive models produced ranking-error reductions of 2.68, 1.46, 6.99, and 1.20% for the Global UniRep, PETase-eUniRep, TEM1-eUniRep, and one-hot sequence representations, respectively. We chose to compare the performance of representations evotuned for PETase and TEM-1 in order to provide a way to evaluate the efficacy of our own evotuning procedure relative to that performed by Biswas et al. While Beta-lactamases (TEM-1) and Lipases (PETase) come from fairly different families of enzymes, we expected that their evotuned representations would possess some similarities. Our experiments show the TEM-1 weights yielding better results than those specifically evotuned for PETase, suggesting that modifications to our own evotuning procedure may be necessary in order to access the full potential of evotuned representations.
The full set of error-reduction results for various combinations of batch sizes and sequence representations for each protein dataset can be seen in supplementary figure (XX).
ADD COMMENTS ON PCA. (probs need a supplementary section with PCA figs / discussion)
AT THE END THIS SECTION SHOULD ANSWER THE QUESTION - SO DO WE ACTUALLY NEED FULL 1900-DIM EUNIREP FOR SINGLE MUTANT PREDICTION?
However, it is important to remember that these results are not fully accurate, as single mutants do not properly represent the evolutionary fitness landscape. Multiple consecutive mutations need to be considered to represent epistatic effects, that may not be captured by more simple top models.
Epistasis detection: prediction of consecutive mutations fitness from single mutant data
An effective predictive model model for protein engineering purposes, needs to be able to predict the epistatic effects of multiple co-expressed mutations with better accuracy than the predictions provided by the additive changes from multiple single-mutant fitness scores. We define epistasis herein as the difference in the effect that multiple mutations have when expressed together, from the additive sum of the individual mutations \cite{pokusaeva2018experimental, sarkisyan}. Due to the cooperative nature of interactions between neighboring amino acids in a protein, it is challenging to predict the effect that a single amino acid mutation has on the protein’s overall ability to fold and retain a stable and functional state. The introduction of multiple mutations increases the complexity of this problem exponentially, and can effectively result in novel interactions between amino acids, not observed in the wild type. To characterize the ability of the different embedded sequence representations to describe the effects of consecutive mutations, we trained our top model on input data from the MS2 single-mutant fitness landscape, and compared its predictions of double mutant fitness to experimentally determined values for the fitness of double-mutants (Fig 1c). When trained on the full single-mutant fitness landscape data, and tested on the double mutant dataset, the best predictive accuracy was achieved by use of the one-hot encoding (MSE = 0.743), followed closely by eUniRep and Global UniRep (MSE = 0.987 and 1.07, respectively). As an additional comparative metric to the one-hot representations, we calculated the additive predictions of fitness for double mutants, using the single-mutant fitness data and Equation 2, which had a mean square error of 0.841, performing just slightly worse than the one-hot representations. <-- ANDREW: i'm confused about this entire italicized section, what is it actually talking about ? what are the MSE values from? trained on single tested on double or tested on single? why do the one hots perform the best if it is on double? i thought they were supposed to suck?