Introduction

Background

Enzymes' key roles in biotechnology and human disease make the accurate modeling of enzyme stability an important goal of the protein modeling community. Accurate prediction of a point mutation's effect on enzyme stability would unlock rational protein engineering approaches, where the information could be used immediately to rationally engineer an enzyme's functional envelope for a desired situation as has been previously explored \cite{22575958}. Furthermore, understanding the changes in enzyme stability that occur upon point mutations would provide huge insight into understanding inherited diseases of metabolism [cite], cancer [cite], as well as mechanisms by which bacteria become resistant to antibiotics, which is lately a public health menace [cite].

Enzyme stability data sets

Previous attempts to predict stability changes in proteins conferred by point mutations have not considered enzymes \cite{18632749} specifically, and have suffered by decades-old low-throughput molecular biology and biochemistry techniques. Thus, the largest data sets for which melting temperature $Tm$ is explicitly measured have been around 30 mutants [cite]. Other studies have collected large amounts of undata, but it suffers from convolution of separate/othoganal [we will show that kcat and tm are not correlated] parameters/fuzzy measuremetns [cite: Romero and others]. Large data sets of thermal stabilities also exist, yet the data sets collected there were created without regard for standardization for varying experimental conditions \cite{14681373} and so suffer from a lack of comparability between measurements. The result is that existing data sets either contain specific measurements for a small number of enzymes, or fuzzy measurements of large numbers of variants. This is likely due to the immense cost and outdated techniques of enzymology.

Computational approaches to predicting the stability of point mutants

A number of computational approaches to predict the stability of enzyme mutations have in turn relied on these far from ideal data sets as training data. No study, that I can find, has combined both a standardized approach to experimental characterization, and produced enough mutants to allow a sufficiently large training set to evaluate predictive ability. Previous attempts to predict the effect of point mutations on antibodies (~100 residues) have used primary sequence and amino acid properties as features \cite{21710487}.

Proposed approach to predicting the stability of point mutants of a family 1 glycoside hydrolase

We have previously shown the ability to select informative features from a set generated by Rosetta for the prediction of kinetic constants \cite{26815142} using structural features calculated from molecular models of BglB, a family 1 glycoside hydrolase. We hypothesized that a subset of Rosetta's feature set could be used to predict protein soluble expression as well as change in thermal stability conferred by point mutations to the BglB structure.

Here, we present protein soluble expression and thermal stability data for 115 single point mutants of BglB. In order to construct this data set, we relied on students learning about molecular modeling to design mutations predicted to be compatible with a model of the enzyme-substrate complex. High-thoughput robot automation was used to perform the molecular cloning, and open-source lab protocols, released on GitHub, were used to experimentally characterize 115 enzyme mutants. The experimental measurements were combined with ~30 structural features calculated from full-atom moelcular models of the enzyme-substrate complex and used to train machine learning classifiers to predict Tm and soluble expression. Model performance was evaluated ...