Wenjun Zheng - Authorea

Wenjun Zheng

Public Documents 3

Predicting hotspots for disease-causing single nucleotide variants using sequences-ba...

Wenjun Zheng

August 31, 2023

To enable personalized genetics and medicine, it is important yet highly challenging to accurately predict disease-causing mutations from the sequences alone at high throughput. To meet this challenge, we build upon recent progress in machine learning, network analysis, and protein language models, and develop a sequences-based variant site prediction workflow based on the protein residue contact networks: 1. We employ and integrate various methods of building protein residue networks using state-of-the-art coevolution analysis tools (e.g., RaptorX, DeepMetaPSICOV, and SPOT-Contact) powered by deep learning. 2. We use machine learning algorithms (e.g., Random Forest, Gradient Boosting, and Extreme Gradient Boosting) to optimally combine 13 network centrality scores (calculated by NetworkX) with 7 other network scores calculated from the contact probability matrices to jointly predict key residues as hot spots for disease mutations. 3. Using a dataset of 107 proteins rich in disease mutations, we rigorously evaluate the network scores individually and collectively in comparison with alternative structures-based network scores (using predicted structures by AlphaFold). By optimally combing three coevolution analysis methods and the resulting network scores by machine learning, we are able to discriminate deleterious and neutral mutation sites accurately (AUC of ROC ~ 0.84). Furthermore, by combining our method with a state-of-the-art predictor of the functional effects of sequence variations based on large protein language models, we have significantly improved the prediction of disease variant sites (AUC ~ 0.89). This work supports a promising strategy of combining an ensemble of network scores based on different coevolution analysis methods via machine learning to predict candidate sites of disease mutations, which will inform downstream applications of disease diagnosis and targeted drug design.

Predicting lipid and ligand binding sites in TRPV1 channel by molecular dynamics simu...

Wenjun Zheng

and 1 more

August 21, 2020

As a key cellular sensor, the TRPV1 channel undergoes a gating transition from a closed state to an open state in response to many physical and chemical stimuli. This transition is regulated by small-molecule ligands including lipids and various agonists/antagonists, but the underlying molecular mechanisms remain obscure. Thanks to recent revolution in cryo-electron microscopy, a growing list of new structures of TRPV1 and other TRPV channels have been solved in complex with various ligands including lipids. Toward elucidating how ligand binding correlates with TRPV1 gating, we have performed extensive molecular dynamics simulations (with cumulative time of 20 μs), starting from high-resolution structures of TRPV1 in both the closed and open states. By comparing between the open and closed state ensembles, we have identified state-dependent binding sites for small-molecule ligands in general and lipids in particular. We further use machine learning to predict top ligand-binding sites as important features to classify the closed vs open states. The predicted binding sites are thoroughly validated by matching homologous sites in all structures of TRPV channels bound to lipids and other ligands, and with previous functional/mutational studies of ligand binding in TRPV1. Taken together, this study has integrated rich structural, dynamic, and functional data to inform future design of small-molecular drugs targeting TRPV1.

Predicting cryptic ligand binding sites based on normal modes guided conformational s...

Wenjun Zheng

June 29, 2020

To greatly expand the druggable genome, fast and accurate predictions of cryptic sites for small molecules binding in target proteins are in high demand. In this study, we have developed a fast and simple conformational sampling scheme guided by normal modes solved from the coarse-grained elastic models followed by atomistic backbone refinement and sidechain repacking. Despite the observations of complex and diverse conformational changes associated with ligand binding, we found that simply sampling along each of the lowest 30 modes is near optimal for adequately restructuring cryptic sites so they can be detected by existing pocket finding programs like fpocket and concavity. We further trained machine-learning protocols to optimize the combination of the sampling-enhanced pocket scores with other dynamic and conservation scores, which only slightly improved the performance. As assessed based on a training set of 84 known cryptic sites and a test set of 14 proteins, our method achieved high accuracy of prediction (with area under the receiver operating characteristic curve > 0.8) comparable to the CryptoSite server. Compared with CryptoSite and other methods based on extensive molecular dynamics simulation, our method is much faster (1-2 hours for an average-size protein) and simpler (using only pocket scores), so it is suitable for high-throughput processing of large datasets of protein structures at the genome scale.