To enable personalized genetics and medicine, it is important yet highly challenging to accurately predict disease-causing mutations from the sequences alone at high throughput. To meet this challenge, we build upon recent progress in machine learning, network analysis, and protein language models, and develop a sequences-based variant site prediction workflow based on the protein residue contact networks: 1. We employ and integrate various methods of building protein residue networks using state-of-the-art coevolution analysis tools (e.g., RaptorX, DeepMetaPSICOV, and SPOT-Contact) powered by deep learning. 2. We use machine learning algorithms (e.g., Random Forest, Gradient Boosting, and Extreme Gradient Boosting) to optimally combine 13 network centrality scores (calculated by NetworkX) with 7 other network scores calculated from the contact probability matrices to jointly predict key residues as hot spots for disease mutations. 3. Using a dataset of 107 proteins rich in disease mutations, we rigorously evaluate the network scores individually and collectively in comparison with alternative structures-based network scores (using predicted structures by AlphaFold). By optimally combing three coevolution analysis methods and the resulting network scores by machine learning, we are able to discriminate deleterious and neutral mutation sites accurately (AUC of ROC ~ 0.84). Furthermore, by combining our method with a state-of-the-art predictor of the functional effects of sequence variations based on large protein language models, we have significantly improved the prediction of disease variant sites (AUC ~ 0.89). This work supports a promising strategy of combining an ensemble of network scores based on different coevolution analysis methods via machine learning to predict candidate sites of disease mutations, which will inform downstream applications of disease diagnosis and targeted drug design.

Wenjun Zheng

and 1 more