Computing protein structure from amino acid sequence information has been a long-standing grand challenge. CASP (Critical Assessment of Structure Prediction) conducts community experiments aimed at advancing solutions to this and related problems. Experiments are conducted every two years. The 2020 experiment (CASP14) saw major progress, with the second generation of deep learning methods delivering accuracy comparable with experiment for many single proteins. There is an expectation that these methods will have much wider application in computational structural biology. Here we summarize results from the most recent experiment, CASP15, in 2022, with an emphasis on new deep learning-driven progress. Other papers in this special issue of Proteins provide more detailed analysis. For single protein structures, the AlphaFold2 deep learning method is still superior to other approaches, but there are two points of note. First, although AlphaFold2 was the core of all the most successful methods, there was a wide variety of implementation and combination with other methods. Second, using the standard AlphaFold2 protocol and default parameters only produces the highest quality result for about two thirds of the targets, and more extensive sampling is required for the others. The major advance in this CASP is the enormous increase in the accuracy of computed protein complexes, achieved by the use of deep learning methods, although overall these do not fully match the performance for single proteins. Here too, AlphaFold2 based method perform best, and again more extensive sampling than the defaults is often required. Also of note are the encouraging early results on the use of deep learning to compute ensembles of macromolecular structures. Critically for the usability of computed structures, for both single proteins and protein complexes, deep learning derived estimates of both local and global accuracy are of high quality, however the estimates in interface regions are slightly less reliable. CASP15 also included computation of RNA structures for the first time. Here, the classical approaches produced better agreement with experiment than the new deep learning ones, and accuracy is limited. Also, for the first time, CASP included the computation of protein-ligand complexes, an area of special interest for drug design. Here too, classical methods were still superior to deep learning ones. Many new approaches were discussed at the CASP conference, and it is clear methods will continue to advance.
For the first time, the 2022 CASP (Critical Assessment of Structure Prediction) community experiment included a section on computing multiple conformations for protein and RNA structures. There was full or partial success in reproducing the ensembles for four of the nine targets, an encouraging result. For protein structures, enhanced sampling with variations of the AlphaFold2 deep learning method was by far the most effective approach. One substantial conformational change caused by a single mutation across a complex interface was accurately reproduced. In two other assembly modeling cases, methods succeeded in sampling conformations near to the experimental ones even though environmental factors were not included in the calculations. An experimentally derived flexibility ensemble allowed a single accurate RNA structure model to be identified. Difficulties included how to handle sparse or low-resolution experimental data and the current lack of effective methods for modeling RNA/protein complexes. However, these and other obstacles appear addressable.
Ncb5or (NADH cytochrome b5 oxidoreductase) is a cytosolic ferric reductase implicated in diabetes and neurological conditions. Ncb5or comprises cytochrome b5 (b5) and cytochrome b5 reductase (b5R) domains separated by a CHORD-Sgt1 (CS) linker domain. Ncb5or redox activity depends on proper interdomain interactions to mediate electron transfer from NADH or NADPH via FAD to heme. While full-length human Ncb5or has proven resistant to crystallization, we have succeeded in obtaining high-resolution atomic structures of the b5 domain and a construct containing the CS and b5R domains (CS/b5R). Ncb5or also contains an N-terminal intrinsically disordered region of 50 residues with a distinctive, conserved L 34MDWIRL 40 motif that has no homologs in animals but is present in root lateral formation protein (RLF) in rice and Increased Recombination Center 21 (IRC21) in baker’s yeast, and in these proteins, it is likewise attached to a b5 domain. After unsuccessful attempts at crystallizing a human Ncb5or construct comprising the N-terminal region naturally fused to the b5 domain, we were able to obtain a high-resolution atomic structure of a recombinant rice RLF construct corresponding to residues 25-129 of human Ncb5or (52% sequence identity; 74% similarity). The structure reveals Trp120 (corresponding to invariant Trp37 in Ncb5or) to be part of an 11-residue α-helix (S 116QMDWLKLTRT 126) packing against two of the four helices in the b5 domain that surround heme (α2 and α5). The Trp120 side chain forms a network of interactions with the side chains of four highly conserved residues corresponding to Tyr85 and Tyr88 (α2), Cys124 (α5), and Leu47 in Ncb5or. Circular dichroism (CD) measurements of human Ncb5or fragments further support a key role of Trp37 in nucleating the formation of the N-terminal helix, whose location in the N/b5 module suggests a role in regulating the function of this multidomain redox enzyme. This study revealed for the first time an ancient origin of a helical motif in the N/b5 module as reflected by its existence in a class of cytochrome b5 proteins from three kingdoms among eukaryotes.
TYK2 is a non-receptor tyrosine kinase, member of the Janus kinases (JAK), with a central role in several diseases, including cancer. The JAKs’ catalytic domains (KD) are highly conserved, yet the isolated TYK2-KD exhibits unique specificities. In a previous work, using molecular dynamics (MD) simulations of a catalytically-impaired TYK2-KD variant (P1104A) we found that this amino-acid change of its JAK-characteristic insert (αFG), acts at the dynamics level. Given that structural dynamics is key to allosteric activation of protein kinases, in this study we applied a long-scale MD simulation and investigated an active TYK2-KD form in the presence of adenosine 5’-triphosphate and one magnesium ion that represents a dynamic and crucial step of the catalytic cycle, in other protein kinases. Community analysis of the MD trajectory shed light, for the first time, on the dynamic profile and dynamics-driven allosteric communications within the TYK2-KD during activation and revealed that αFG and amino-acids P1104, P1105 and I1112 in particular, hold a pivotal role and act synergistically with a dynamically coupled communication network of amino-acids serving intra-KD signaling for allosteric regulation of TYK2 activity. Corroborating our findings, most of the identified amino-acids are associated with cancer-related missense/splice-site mutations of the Tyk2 gene. We propose that the conformational dynamics at this step of the catalytic cycle, coordinated by αFG, underlies TYK2-unique substrate recognition and accounts for its distinct specificity. In total, this work adds to knowledge towards an in-depth understanding of TYK2 activation and may be valuable towards a rational design of allosteric TYK2-specific inhibitors.
We present the results for CAPRI Round 54, the 5th joint CASP-CAPRI protein assembly prediction challenge. The Round offered 37 targets, including 14 homo-dimers, 3 homo-trimers, 13 hetero-dimers including 3 antibody-antigen complexes, and 7 large assemblies. On average ~70 CASP and CAPRI predictor groups, including more than 20 automatics servers, submitted models for each target. A total of 21941 models submitted by these groups and by 15 CAPRI scorer groups were evaluated using the CAPRI model quality measures and the DockQ score consolidating these measures. The prediction performance was quantified by a weighted score based on the number of models of acceptable quality or higher submitted by each group among their 5 best models. Results show substantial progress achieved across a significant fraction of the 60+ participating groups. High-quality models were produced for about 40% for the targets compared to 8% two years earlier, a remarkable improvement resulting from the wide use of the AlphaFold2 and AlphaFold-Multimer software. Creative use was made of the deep learning inference engines affording the sampling of a much larger number of models and enriching the multiple sequence alignments with sequences from various sources. Wide use was also made of the AlphaFold confidence metrics to rank models, permitting top performing groups to exceed the results of the public AlphaFold-Multimer version used as a yard stick. This notwithstanding, performance remained poor for complexes with antibodies and nanobodies, where evolutionary relationships between the binding partners are lacking, and for complexes featuring conformational flexibility, clearly indicating that the prediction of protein complexes remains a challenging problem.
The rapid evolution of protein structure prediction tools has significantly broadened access to protein structural data. Although predicted structure models have the potential to accelerate and impact fundamental and translational research significantly, it is essential to note that they are not validated and cannot be considered the ground truth. Thus, challenges persist, particularly in capturing protein dynamics, predicting multi-chain structures, interpreting protein function, and assessing model quality. Interdisciplinary collaborations are crucial to overcoming these obstacles. Databases like the AlphaFold Protein Structure Database, the ESM Metagenomic Atlas, and initiatives like the 3D-Beacons Network provide FAIR access to these data, enabling their interpretation and application across a broader scientific community. Whilst substantial advancements have been made in protein structure prediction, further progress is required to address the remaining challenges. Developing training materials, nurturing collaborations, and ensuring open data sharing will be paramount in this pursuit. The continued evolution of these tools and methodologies will deepen our understanding of protein function and accelerate disease pathogenesis and drug development discoveries.
CASP assessments primarily rely on comparing predicted coordinates with experimental reference structures. However, errors in the reference structures can potentially reduce the accuracy of the assessment. This issue is particularly prominent in cryoEM-determined structures, and therefore, in the assessment of CASP15 cryoEM targets, we directly utilized density maps to evaluate the predictions. A method for ranking the quality of protein chain predictions based on rigid fitting to experimental density was found to correlate well with the CASP assessment scores. Overall, the evaluation against the density map indicated that the models are of high accuracy although local assessment of predicted side chains in a 1.52 Å resolution map showed that side-chains are sometimes poorly positioned. The top 136 predictions associated with 9 protein target reference structures were selected for refinement, in addition to the top 40 predictions for 11 RNA targets. To this end, we have developed an automated hierarchical refinement pipeline in cryoEM maps. For both proteins and RNA, the refinement of CASP15 predictions resulted in structures that are close to the reference target structure, including some regions with better fit to the density. This refinement was successful despite large conformational changes and secondary structure element movements often being required, suggesting that predictions from CASP-assessed methods could serve as a good starting point for building atomic models in cryoEM maps for both proteins and RNA. Loop modeling continued to pose a challenge for predictors with even short loops failing to be accurately modeled or refined at times. The lack of consensus amongst models suggests that modeling holds the potential for identifying more flexible regions within the structure.
The canonical function of glutamyl-tRNA synthetase (GluRS) is to glutamylate tRNA Glu. Yet, not all bacterial GluRSs glutamylate tRNA Glu; many glutamylate both tRNA Glu and tRNA Gln, while some glutamylate only tRNA Gln and not the cognate substrate tRNA Glu. Understanding the basis of this unique tRNA Glx-specificity is important. Mutational studies have hinted at hotspot residues, both on tRNA Glx and GluRS, that play crucial roles in tRNA Glx-specificity. But the underlying structural basis remains unexplored. Majority of biochemical studies related to tRNA Glx-specificity have been performed on GluRS from Escherichia coli and other proteobacterial species. However, since the early crystal structures of GluRS and tRNA Glu-bound GluRS were from non-proteobacterial species ( Thermus thermophilus), the proteobacterial biochemical data have often been interpreted in the context of non-proteobacterial GluRS structures. Marked differences between proteo- and non-proteobacterial GluRSs have been demonstrated and therefore it is important that tRNA Glx-specificity be understood vis-a-vis proteobacterial GluRS structures. Towards this goal we have solved the crystal structure of GluRS from E. coli. Using the solved structure and several other currently available proteo- and non-proteobacterial GluRS crystal structures, we have probed the structural basis of tRNA Glx-specificity of bacterial GluRSs. Specifically, our analysis suggests a unique role played by a tRNA Glx D-helix contacting loop of GluRS in modulation of tRNA Gln-specificity. While earlier studies had identified functional hotspots on tRNA Glx that controlled tRNA Glx-specificity of GluRS, this is the first report of complementary signatures of tRNA Glx-specificity in GluRS.
Protein domains are structural, functional, and evolutionary units. These domains bring out the diversity of functionality by means of interactions with other co-existing domains and provide stability. Hence, it is important to study intra-protein inter-domain interactions from the perspective of types of interactions. Domains within a chain could interact over short timeframes or permanently, rather like protein-protein interactions (PPIs). However, no systematic study has been carried out between two classes, namely permanent and transient domain-domain interactions (DDIs). In this work, we studied 264 two-domain proteins, belonging to either of these classes and their interfaces on the basis of several factors, such as interface area and details of interactions (number, strengths, and types of interactions). We also characterized them based on residue conservation at the interface, correlation of residue motions across domains, its involvement in repeat formation, and their involvement in particular molecular processes. Finally, we could analyse the interactions arising from domains in two-domain monomeric proteins, and we observed significant differences between these two classes of domain interactions and a few similarities. This study will help to obtain a better understanding of structure-function and folding principles of multi-domain proteins.
Proteins such as enzymes perform their function by predominant non-covalent bond interactions between transiently interacting units. There is an impact on the overall structural topology of the protein, albeit transient nature of such interactions, that enable proteins to deactivate or activate. This aspect of the alteration of the structural topology is studied by employing protein structural networks, which are node-edge representative models of protein structure, reported as a robust tool for capturing interactions between residues. Several methods have been optimised to collect meaningful, functionally relevant information by studying alteration of structural networks. In this article, different methods of comparing protein structural networks are employed, along with spectral decomposition of graphs to study the subtle impact of protein-protein interactions. A detailed analysis of the structural network of interacting partners is performed across a dataset of around 900 pairs of bound complexes and corresponding unbound protein structures. The variation in network parameters at, around and far away from the interface are analysed. Finally, we present interesting case studies, where an allosteric mechanism of structural impact is understood from communication-path detection methods. The results of this analysis are beneficial in understanding protein stability, for future engineering and docking studies.
Trimethylamine monooxygenase ( Tmm, EC-220.127.116.11) belongs to the family of flavin-containing monooxygenases (FMOs) that oxidize trimethylamine into trimethylamine-N-oxide (TMAO). Conventional methods for assaying Tmm are accurate over a narrow range of substrate/ product concentrations. Here we report a TMAO-specific enzymatic assay for Tmm using polyallylamine hydrochloride (PAHCl)-capped MnO 2 nanoparticles (PAHCl@MnO 2). We achieved TMAO specificity using iodoacetonitrile to remove interfering trimethylamine. The change in the concentration of TMAO is measured by observing the difference in the absorbance of 3,3´,5,5´-tetramethylbenzidine (TMB) at 652 nm. The assay is tolerant to several interfering metal ions and other compounds. This method is more reliable and easier than currently known methods. The limit of detection (LOD) and limit of quantitation (LOQ) are 1 µM and 10 µM, respectively, for direct TMAO measurement.
The results of tertiary structure assessment at CASP15 are reported. For the first time, recognising the outstanding performance of AlphaFold 2 (AF2) at CASP14, all single chain predictions were assessed together, irrespective of whether a template was available. At CASP15 there was no single stand-out group, with most of the best-scoring groups - led by PEZYFoldings, UM-TBM and Yang Server - employing AF2 in one way or another. Many top groups paid special attention to generating deep Multiple Sequence Alignments (MSAs) and testing variant MSAs, thereby allowing them to successfully address some of the hardest targets. Such difficult targets, as well as lacking templates, were typically proteins with few homologues: small size, high α-helical content and monomeric structure were other likely aggravating factors. Local divergence between prediction and target correlated with localisation at crystal lattice or chain interfaces, and with regions exhibiting high B-factor factors in crystal structure targets, but should not necessarily be considered as representing error in the prediction. However, analysis of exposed and buried side chain accuracy showed room for improvement even in the latter. Nevertheless, a majority of groups, including those applying methods similar to those used to generate major resources such as the AlphaFold Protein Structure Database and the ESM Metagenomic atlas, produced high quality predictions for most targets which are valuable for experimental structure determination, functional analysis and many other tasks across biology.
The core metabolic reactions of life drive electrons through a class of redox protein enzymes, the oxidoreductases. The energetics of electron flow is determined by the redox potentials of organic and inorganic cofactors as tuned by the protein environment. Understanding how protein structure affects oxidation-reduction energetics is crucial for studying metabolism, creating bioelectronic systems, and tracing the history of biological energy utilization on Earth. We constructed ProtReDox ([https://protein-redox-potential.web.app](https://protein-redox-potential.web.app)), a manually curated database of experimentally determined redox potentials. With over 500 measurements, we can begin to identify how proteins modulate oxidation-reduction energetics across the tree of life. By mapping redox potentials onto networks of oxidoreductase fold evolution, we can infer the evolution of electron transfer energetics over deep-time. ProtReDox is designed to include user-contributed submissions with the intention of making it a valuable resource for researchers in this field.
The prediction of protein-ligand complexes (PLC), using both experimental and predicted structures, is an active and important area of research, underscored by the inclusion of the Protein-Ligand Interaction category in the latest round of the Critical Assessment of Protein Structure Prediction experiment CASP15. The prediction task in CASP15 consisted of predicting both the 3-dimensional structure of the receptor protein as well as the position and conformation of the ligand. This paper addresses the challenges and proposed solutions for devising automated benchmarking techniques for PLC prediction. The reliability of experimentally solved PLC as ground truth reference structures is assessed using various validation criteria. Similarity of PLC to previously released complexes are employed to judge the novelty and difficulty of a PLC as a prediction target. We show that the commonly used PDBBind time-split test-set is inappropriate for comprehensive PLC evaluation. Finally, we introduce a fully automated pipeline that predicts PLC and evaluates the accuracy of the protein structure, ligand pose, and protein-ligand interactions.
Prediction categories in the Critical Assessment of Structure Prediction (CASP) experiments change with the need to address specific problems in structure modeling. In CASP15, four new prediction categories were introduced: RNA structure, ligand-protein complexes, accuracy of oligomeric structures and their interfaces, and ensembles of alternative conformations. This paper lists technical specifications for these categories and describes their integration in the CASP data management system.
This article reports and analyzes the results of protein complex model accuracy estimation by our methods (DeepUMQA3 and GraphGPSM) in the 15 th Critical Assessment of techniques for protein Structure Prediction (CASP15). The new deep learning-based multimeric complex model accuracy estimation methods are proposed based on the ensemble of three level features coupling with deep residual/graph neural networks. For the input multimeric complex model, we describe it from three levels: overall complex features, intra-monomer features, and inter-monomer features. We designed an overall ultrafast shape recognition (USR) to characterize the relationship between local residues and the overall complex topology, and an inter-monomer USR to characterize the relationship between the residues of one monomer and the topology of other monomers. On the 39 complex targets of CASP15, DeepUMQA3 (Group name: GuijunLab-RocketX) ranked first in the assessment of interface residue accuracy. The Pearson correlation coefficient (PCC) between the interface residues lDDT predicted by DeepUMQA3 and the real lDDT is 0.570, and DeepUMQA3 achieved the highest PCC on 29 out of 39 targets. GraphGPSM (Group name: GuijunLab-PAthreader) had a TM-score PCC>0.9 on 14 targets, showing a good ability to estimate the overall fold accuracy.
The human predictor team PEZYFoldings got third place with GDT-TS (First place with the Assessor’s formulae) in the single-domain category and tenth place in the multimer category in CASP15. In this paper, I describe the exact method used by PEZYFoldings in competitions. As AlphaFold2 and AlphaFold-Multimer, developed by DeepMind, are state-of-the-art structure prediction tools, it was assumed that enhancing the input and output of the tools was an effective strategy to obtain the highest accuracy for structure prediction. Therefore, I used additional tools and databases to collect evolutionarily related sequences and introduced a deep-learning-based model in the refinement step. In addition to these modifications, manual interventions were performed to address various tasks. Detailed analyses were performed after the competition to identify the main contributors to performance. Comparing the number of evolutionarily related sequences I used with those of the other teams that provided AlphaFold2’s baseline predictions revealed that an extensive sequence similarity search was one of the main contributors. The impact of the refinement model was minimal (p <0.05 for the TM score). In addition, I noticed that I had gained large Z-scores with the subunits of H1137, for which I performed manual domain parsing considering the interfaces between the subunits. This finding implies that the manual intervention contributed to my performance. The prediction performance was low when I could not identify the evolutionarily related sequences. T1130 is an example; however, other teams can model better structures. Based on the discussions from the CASP15 conference, the two teams that ranked higher than PEZYFoldings had some hits for T1130. This may be because T1130 is a eukaryotic protein, whereas the additional databases used were mainly from metagenomic sequences, which primarily consist of prokaryotic proteins. These results highlight the opportunities for improvement in 1) multimer prediction, 2) building larger and more diverse databases, and 3) developing tools to predict structures from primary sequences alone. In addition, transferring the manual intervention process to automation is a future concern.
We introduce a deep learning-based ligand pose scoring model called zPoseScore for predicting protein-ligand complexes in the 15th Critical Assessment of Protein Structure Prediction (CASP15). Our contributions are three-fold: firstly, we generate six training and evaluation datasets by employing advanced data augmentation and sampling methods. Secondly, we redesign the “zFormer” module, inspired by AlphaFold2’s Evoformer, to efficiently describe protein-ligand interactions. This module enables the extraction of protein-ligand paired features that lead to accurate predictions. Lastly, we develop the zPoseScore framework with zFormer for scoring and ranking ligand poses, allowing for atomic-level protein-ligand feature encoding and fusion to output refined ligand poses and ligand per-atom deviations. Our results demonstrate excellent performance on various testing datasets, achieving Pearson’s correlation R = 0.783 and 0.659 for ranking docking decoys generated based on experimental and predicted protein structures of CASF-2016 protein-ligand complexes. Additionally, we obtain an averaged lDDT = 0.558 of AIchemy_LIG2 in CASP15 for de novo protein-ligand complex structure predictions. Detailed analysis shows that accurate ligand binding site prediction and side-chain orientation are crucial for achieving better prediction performance. Our proposed model is one of the most accurate protein-ligand pose prediction models and could serve as a valuable tool in small molecule drug discovery.