Shoshana J. Wodak, VIB-VUB Center for Structural Biology, Plailaan
2, 1050 Brussels Belgium. Email:
Shoshana.wodak@gmail.com,
ORDIC: http://orcid.org/0000-0002-0701-6545
Structural biology has been undergoing an unprecedented transformation
recently thanks to major breakthroughs in experimental methods such as
cryogenic electron microscopy (cryo-EM) and ground-breaking
computational approaches for predicting the 3D structure of proteins
based on cutting edge deep-learning methods.
Owing to spectacular advances in detector technology and software
algorithms, cryo-EM has revolutionized biology by enabling the
determination of complex biomolecular structures at near-atomic
resolution[1]. Over less than a decade, the number
of near-atomics-resolution structures solved using cryo-EM has grown
exponentially [2]. Foregoing the need for crystal formation, it has
enabled to elucidate the structures of important receptors and membrane
proteins, historically refractory to crystallographic studies [3].
Furthermore, increasingly sophisticated computational and experimental
cryo-EM methods are making it possible to unveil different
conformational and/or compositional states of the systems under study
[4-6], thereby providing valuable information on the dynamic
properties of these systems underpinning their biological function.
In parallel, the progressive introduction of new generation methods in
deep learning - a subfield of machine learning- to a maturing protein
modelling field has recently culminated with the phenomenal success of
AlphaFold2 (AF2), the deep-leaning engine developed by Google DeepMind,
in predicting the 3D structure of single chain proteins to an accuracy
rivaling with that of experimentally determined structures [7, 8].
This achievement has been a game changer with immense repercussions
across the fields of computational and experimental structural biology
[9, 10]. The software of these algorithms was made freely available
to the public [11] [https://github. com/deepmind/alphafold]
setting the stage for rapid further developments [12]. Additionally,
DeepMind has partnered with the European Bioinformatics Institute (EBI)
to create AlphaFold-DB [13], offering open access to over 200
million protein structures predicted by AlaFold, providing broad
coverage of UniProt [14].
The vast increase in high accuracy coverage of protein structure space
is already having a major impact in many areas of scientific research,
including elucidating aspects of evolutionary relationships and protein
function [15], identifying potential drug targets[16] and
greatly aiding experimental structure determination[17]. However,
AF2 as designed, and hence also AlphaFold-DB, provide no information on
the dynamic properties of proteins nor on the alternative conformations
that proteins sample to carry out their function [18]. Information
is also lacking on functionally important bound small molecule ligands,
and on the oligomeric structure of native proteins, where two or more
proteins (subunits) form higher order complexes[19]. Of these
essential areas the prediction of protein complexes, has received
special attention in the last two years. Viewed as the next frontier for
deep learning–based structure prediction methods, the community devised
ways of extending the power of AF2 to the prediction of protein
complexes. Creative uses of AF2 and AlphaFold2-Multimer, the inference
engine of AlphaFold directly trained on protein complexes from the
PDB[20], which include aggressive sampling of candidate solutions
combined with effective scoring and ranking models, helped yielding
high-quality models for 40% of the assembly targets in the CASP-CAPRI
(Critical Assessment of Structure Prediction -Critical Assessment
of PRedicted Interactions [21] ) blind prediction challenge of 2022
compared to the mere 8% produced in previous challenges [Lensink et
al. (under review)]. These are very encouraging results, suggesting
nevertheless that significant room remains for improvement [21].
Free access to the code of AF2 and similar deep-learning based software
like RoseTTAfold [22], offered by various community-based resources
such as ColabFold [12] played a key role in these advances. Access
to these resources is also having a resounding impact on the
experimental determination of protein structures. In several instances,
hard-to- solve X-ray and cryo-EM structures have been elucidated by
using AlphaFold predicted structures in molecular replacement protocols
[23, 24]. AlphaFold and RoseTTAFold models have been used
successfully to fit residual electron density in cryo-EM maps, most
notably in a recent assembly of the human nuclear pore complex [25].
This special issue of Proteomics features seven contributions showcasing
how the new wave of deep-learning tools and generated data are being
leveraged and integrated into cutting edge research in the life sciences
and how the frontier between experimental and computational approaches
is increasingly blurred. Contributions to this issue also underscore the
importance of free access to the data generated by both experimental and
computational approaches. These data are inherently complex and noisy,
hence the crucial role of tools for extracting useful information from
these data, a key step in generating new knowledge.
Varadi and Velankar, the team at the PDBe (Protein Databank
Europe), developing and managing the AlpfaFold-DB, in close
collaboration with Google DeepMind, describe the specifics of the
database, the key meta-information it includes and the impact it is
having across the fields of life-sciences research and development. They
discuss the challenges of organizing analyzing and providing meaningful
user access to 214 million unique protein structures, compared to around
200,000 PDB structures corresponding to 60,000 unique protein sequences.
Our attention is attracted to the specifics of the new body of data,
including the confidences scores associated with the predicted models,
the new insights they provide and some important limitations. Also
highlighted is the important role public data providers play in
integrating the new structural information with other key biological
data and disseminating it across other key resources such as UniProt and
more specialized databases such as and InterPro [26] and Pfam
[27]among others.
Tüting et al. , describe how AlphaFold predicted
structures enables the interpretation of cryo-EM maps from native cell
extracts. Combining data on crosslinking mass spectrometry[28] with
other proteomics techniques and systematic fitting of predicted
structures of single chain proteins from AlphaFold-DB into
medium-resolution cryo-EM maps of yeast native cell extracts, enabled
the team to derived models of the large multi-component heterogenous and
plastic protein assembly of the 2.6 MDa complex of yeast fatty acid
synthase, the closest one can come today to characterising such
assemblies in-situ using cryo-EM.
The study of Pei et al . al, is another edifying
example of how AlphaFold predicted structures are being used to generate
new knowledge on cellular processes, in this case providing insights
into the critical regulatory roles played by PARylation (the
posttranslational modification of proteins by linear or branched chains
of ADP-ribose units). To this end the study gathered data on sites
modified by PARylation on acidic residues (Asp (Asp (D)/Glu (E)) in more
than 300 human proteins. Following the example of an earlier
study[29], the joint multiple sequence alignments generated for
these proteins were fed to the AlphaFold2 inference engine to predict a
set of 260 confident interaction interfaces. Mapping the PARylation
sites of interest into these interfaces revealed these sites to occur
preferentially in coil and disordered regions and that interaction
interfaces featuring these sites involve short linear sequence
motifs[30] in both disordered and globular domains. More
specifically, D/E-PARylation sites were found in the interfaces of key
components of the RNA transcription and export complex, suggesting that
systematic PARylation-based regulation intervenes in multiple
RNA-related processes.
Deep Learning methods are also making headway in other areas of
structural and systems biology. Cohen and Schneidman-Duhovnyreport a new deep learning model for improving the information on the
spatial proximity of residues in multi-subunit complexes derived from
crosslinking mass spectrometry (XLMS), which the cryo-EM study of Tütinget al . in this issue critically relied on to model the large
yeast fatty acid synthase complex from cryo-EM data. Chemical
crosslinking followed by mass spectrometry [28] is increasingly used
to derive distance constraints or restraints in integrative modeling
techniques used to build models of large multi-component protein
assemblies. One of the challenges in interpreting crosslinking data is
designing a scoring function capable of quantifying how well a candidate
model fits the data. Most available approaches set an upper limit on the
distance between a cross-linked residue pair and compute the fraction of
satisfied crosslinks, neglecting the crucial influence of the spatial
neighbourhood on the distance spanned by the crosslinker. This
shortcoming is addressed by the deep learning model XlinkNet, trained to
predict the optimal distance range -instead of only an upper limit- for
a crosslinked residue pair based on their spatial environment of the
predicted structure. The model trained and validated using many
thousands protein structures from the wwPDB and AlphaFoldDB, and XLMS
data on tens of thousands of crosslinks, was shown to accurately
classify the distances ranges of most of the tested crosslinks and
provide valuable insights into the associated structural determinants.
The authors also stress the pressing need for better curation and
seamless links to publicly available structural information forin-vitro crosslinking data (mainly deposited in the PRIDE
database [31]).
Accounting for the dynamic properties of proteins or modeling the
alternative conformations that proteins sample to carry out their
function, is a long-standing challenge that main-stream protein modeling
techniques have been struggling with and deep-learning methods still do
not master. Christoffer and Kihara propose an approach for
modeling conformational changes often associated with the formation of
protein complexes, which they apply to protein-nucleic acid complexes.
These are very challenging complexes to model because their formation is
associated with a large flexibility of the components (see for example
ref [32]). The proposed approach focuses on modeling this type of
motion for the protein components alone, starting from the unbound
version of the corresponding structures and considers systems where this
motion involves the reorientation and displacement of relatively rigid
domains linked by flexible segments. A customized protein docking
algorithm designed to handle this type of motion [33] is used to
predict the most likely collective binding modes of all individual
domains to the nucleic acid component. Next, an anisotropic network
model (ANM) [34] is employed to deform the full protein structures
to match the docked domains, and further refine the resulting models to
optimize interactions with the nucleic acid component(s). Benchmarking
this approach on a limited set of protein-nucleic acid complexes where
such large-scale collective motions take place, and illustrating
representative examples, suggest that it represents a promising strategy
for tackling this difficult modeling problem.
Reliably scoring and ranking candidate models of protein complexes and
assigning the oligomeric state of proteins are other important
challenges unmet by current modeling algorithms, including deep
learning-based methods such as AlphaFold. The latter rely primarily on
various confidence scores to rank models whose relation to the physical
properties of the protein remains uncertain [35]. Schwekeet al. report a community-wide efforts to tackle these problems.
This effort exploits QS-Align [36] and ProtCID [37], two
noteworthy specialized resources that characterize protein complexes and
their interfaces. Using these resources the study produces a carefully
crafted benchmark dataset of ~1700 homodimer protein
crystal structures, which includes both physiological and
non-physiological complexes. This dataset is used to evaluate the
performance of protein interface scoring functions in discriminating
between both types of complexes. The unique features of the dataset
stems from its size, accuracy, and the fact that it contained
particularly challenging complexes to segregate correctly. Evaluating
252 scoring functions developed by 13 expert groups, this study
demonstrates the complementarity of these scoring function and shows
that the combined power of these functions outperforms individual
scores, paving the way for further optimizing such functions. This has
important implications for the development of improved methods for the
prediction of protein-protein interactions. The benchmark dataset and
its analysis should serve as a valuable resource for such future work.
The last 2 decades have seen an explosive growth of protein-protein
interaction (PPI) data derived from both small-scale and proteome-scale
interrogations in organisms from bacteria to human [38] as well from
various computational methods [39] including AlphaFold [40].
Data from these studies have been used to construct PPI networks, and
various properties of these networks have been scrutinized to gain
biological insights. With the PPI data being inherently noisy,
extracting meaningful information from these networks requires cross
referencing and integrating the PPI data with many other types of data,
such as protein and gene sequences, gene and protein expression levels,
as well as structural data [41]. The availability of tools and
resources that facilitate such integration and ensuing analyses is
therefore crucial, and particularly relevant to the main topic of this
Journal. LEVELNET the resource presented by Behbahani et
al. is such facilitator. Focusing on proteins whose 3D structures are
available in the PDB, LEVELNET integrates and explores PPI networks from
multiple sources of evidence. It builds a grid of networks for each
source representing different views of the associated interactions. It
allows to cluster interactions made by groups of related proteins based
on sequence identity and to infer interactions through homology
transfer. Examples of potential applications include the investigation
of the structural evidence supporting PPIs associated with specific
biological processes, comparing the PPI networks obtained through
computational inference versus homology transfer, and creating PPI
benchmark datasets with desired properties.
This transformational era is propelling structural biology to the
mainstream of research in the life sciences and beyond. This momentum
will benefit from ensuring free access to data and tools, and from
enhancing the synergy between multidisciplinary research, data
providers, and community-wide initiatives that critically benchmark and
evaluate progress in the field.
References
[1] Cheng, Y., Grigorieff, N., Penczek, P. A., Walz, T., A primer to
single-particle cryo-electron microscopy. Cell 2015, 161 ,
438-449.
[2] Afonine, P. V., Klaholz, B. P., Moriarty, N. W., Poon, B.
K., et al. , New tools for the analysis and validation of cryo-EM
maps and atomic models. Acta Crystallogr D Struct Biol 2018,74 , 814-840.
[3] de Oliveira, T. M., van Beek, L., Shilliday, F., Debreczeni, J.
E., Phillips, C., Cryo-EM: The Resolution Revolution and Drug Discovery.SLAS Discov 2021, 26 , 17-31.
[4] Baretic, D., Pollard, H. K., Fisher, D. I., Johnson, C.
M., et al. , Structures of closed and open conformations of
dimeric human ATM. Sci Adv 2017, 3 , e1700933.
[5] Zhong, E. D., Bepler, T., Berger, B., Davis, J. H., CryoDRGN:
reconstruction of heterogeneous cryo-EM structures using neural
networks. Nat Methods 2021, 18 , 176-185.
[6] Kinman, L. F., Powell, B. M., Zhong, E. D., Berger, B., Davis,
J. H., Uncovering structural ensembles from single-particle cryo-EM data
using cryoDRGN. Nat Protoc 2023, 18 , 319-339.
[7] Jumper, J., Evans, R., Pritzel, A., Green, T., et al. ,
Highly accurate protein structure prediction with AlphaFold.Nature 2021.
[8] Jumper, J., Evans, R., Pritzel, A., Green, T., et al. ,
Applying and improving AlphaFold at CASP14. Proteins 2021,89 , 1711-1721.
[9] Callaway, E., ’It will change everything’: DeepMind’s AI makes
gigantic leap in solving protein structures. Nature 2020,588 , 203-204.
[10] Akdel, M., Pires, D. E. V., Pardo, E. P., Janes, J., et
al. , A structural biology community assessment of AlphaFold2
applications. Nat Struct Mol Biol 2022, 29 , 1056-1067.
[11] Jumper, J., Hassabis, D., Protein structure predictions to
atomic accuracy with AlphaFold. Nature Methods 2022, 19 ,
11-12.
[12] Mirdita, M., Ovchinnikov, S., Steinegger, M., ColabFold -
Making protein folding accessible to all. bioRxiv 2021,
2021.2008.2015.456425.
[13] Varadi, M., Anyango, S., Deshpande, M., Nair, S., et
al. , AlphaFold Protein Structure Database: massively expanding the
structural coverage of protein-sequence space with high-accuracy models.Nucleic acids research 2022, 50 , D439-D444.
[14] UniProt, C., UniProt: the universal protein knowledgebase in
2021. Nucleic Acids Res 2021, 49 , D480-D489.
[15] Bordin, N., Dallago, C., Heinzinger, M., Kim, S., et
al. , Novel machine learning approaches revolutionize protein knowledge.Trends Biochem Sci 2023, 48 , 345-359.
[16] Ren, F., Ding, X., Zheng, M., Korzinkin, M., et al. ,
AlphaFold accelerates artificial intelligence powered drug discovery:
efficient discovery of a novel CDK20 small molecule inhibitor.Chem Sci 2023, 14 , 1443-1452.
[17] Varadi, M., Velankar, S., The impact of AlphaFold Protein
Structure Database on the fields of life sciences. Proteomics2022, e2200128.
[18] Fleishman, S. J., Horovitz, A., Extending the New Generation of
Structure Predictors to Account for Dynamics and Allostery. J Mol
Biol 2021, 433 , 167007.
[19] Perrakis, A., Sixma, T. K., AI revolutions in biology: The joys
and perils of AlphaFold. EMBO Rep 2021, 22 , e54046.
[20] Evans, R., O’Neill, M., Pritzel, A., Antropova, N., et
al. , Protein complex prediction with AlphaFold-Multimer. BioRxiv2021.
[21] Wodak, S. J., Vajda, S., Lensink, M. F., Kozakov, D., Bates, P.
A., Critical Assessment of Methods for Predicting the 3D Structure of
Proteins and Protein Complexes. Annu Rev Biophys 2023, 52 ,
183-206.
[22] Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., et
al. , Accurate prediction of protein structures and interactions using a
three-track neural network. Science 2021, 373 , 871-876.
[23] Kryshtafovych, A., Moult, J., Albrecht, R., Chang, G. A.,
et al. , Computational models in the service of X-ray and cryo-electron
microscopy structure determination. Proteins 2021, 89 ,
1633-1646.
[24] McCoy, A. J., Sammito, M. D., Read, R. J., Implications of
AlphaFold2 for crystallographic phasing by molecular replacement.Acta Crystallogr D Struct Biol 2022, 78 , 1-13.
[25] Mosalaganti, S., Obarska-Kosinska, A., Siggel, M., Turonova,
B., et al. , Artificial intelligence reveals nuclear pore
complexity. bioRxiv 2021.
[26] Blum, M., Chang, H. Y., Chuguransky, S., Grego, T., et
al. , The InterPro protein families and domains database: 20 years on.Nucleic Acids Res 2021, 49 , D344-D354.
[27] Mistry, J., Chuguransky, S., Williams, L., Qureshi, M.,
et al. , Pfam: The protein families database in 2021. Nucleic
Acids Res 2021, 49 , D412-D419.
[28] Iacobucci, C., Gotze, M., Sinz, A., Cross-linking/mass
spectrometry to get a closer view on protein interaction networks.Curr Opin Biotechnol 2020, 63 , 48-53.
[29] Bryant, P., Pozzati, G., Elofsson, A., Improved prediction of
protein-protein interactions using AlphaFold2. Nat Commun 2022,13 , 1265.
[30] Tompa, P., Davey, N. E., Gibson, T. J., Babu, M. M., A million
peptide motifs for the molecular biologist. Mol Cell 2014,55 , 161-169.
[31] Perez-Riverol, Y., Bai, J., Bandla, C., Garcia-Seisdedos,
D., et al. , The PRIDE database resources in 2022: a hub for mass
spectrometry-based proteomics evidences. Nucleic Acids Res 2022,50 , D543-D552.
[32] Dimitrova-Paternoga, L., Jagtap, P. K. A., Chen, P. C., Hennig,
J., Integrative Structural Biology of Protein-RNA Complexes.Structure 2020, 28 , 6-28.
[33] Christoffer, C., Kihara, D., Domain-Based Protein Docking with
Extremely Large Conformational Changes. J Mol Biol 2022,434 , 167820.
[34] Atilgan, A. R., Durell, S. R., Jernigan, R. L., Demirel, M.
C., et al. , Anisotropy of fluctuation dynamics of proteins with
an elastic network model. Biophys J 2001, 80 , 505-515.
[35] Roney, J. P., Ovchinnikov, S., State-of-the-Art Estimation of
Protein Model Accuracy Using AlphaFold. Phys Rev Lett 2022,129 , 238101.
[36] Dey, S., Prilusky, J., Levy, E. D., QSalignWeb: A Server to
Predict and Analyze Protein Quaternary Structure. Front Mol
Biosci 2021, 8 , 787510.
[37] Xu, Q., Dunbrack, R. L., Jr., ProtCID: a data resource for
structural information on protein interactions. Nat Commun 2020,11 , 711.
[38] Wodak, S. J., Vlasblom, J., Turinsky, A. L., Pu, S.,
Protein-protein interaction networks: the puzzling riches. Curr
Opin Struct Biol 2013, 23 , 941-953.
[39] Singh, R., Park, D., Xu, J., Hosur, R., Berger, B., Struct2Net:
a web service to predict protein-protein interactions using a
structure-based approach. Nucleic Acids Res 2010, 38 ,
W508-515.
[40] Petrey, D., Zhao, H., Trudeau, S. J., Murray, D., Honig, B.,
PrePPI: A Structure Informed Proteome-wide Database of Protein-Protein
Interactions. J Mol Biol 2023, 435 , 168052.
[41] Szklarczyk, D., Gable, A. L., Nastou, K. C., Lyon, D., et
al. , The STRING database in 2021: customizable protein-protein
networks, and functional characterization of user-uploaded
gene/measurement sets. Nucleic Acids Res 2021, 49 ,
D605-D612.