ABSTRACT
The human predictor team PEZYFoldings got third place with GDT-TS (First
place with the Assessor’s formulae) in the single-domain category and
tenth place in the multimer category in CASP15. In this paper, I
describe the exact method used by PEZYFoldings in competitions.
As AlphaFold2 and AlphaFold-Multimer, developed by DeepMind, are
state-of-the-art structure prediction tools, it was assumed that
enhancing the input and output of the tools was an effective strategy to
obtain the highest accuracy for structure prediction. Therefore, I used
additional tools and databases to collect evolutionarily related
sequences and introduced a deep-learning-based model in the refinement
step. In addition to these modifications, manual interventions were
performed to address various tasks.
Detailed analyses were performed after the competition to identify the
main contributors to performance. Comparing the number of evolutionarily
related sequences I used with those of the other teams that provided
AlphaFold2’s baseline predictions revealed that an extensive sequence
similarity search was one of the main contributors. The impact of the
refinement model was minimal (p <0.05 for the TM score). In
addition, I noticed that I had gained large Z-scores with the subunits
of H1137, for which I performed manual domain parsing considering the
interfaces between the subunits. This finding implies that the manual
intervention contributed to my performance.
The prediction performance was low when I could not identify the
evolutionarily related sequences. T1130 is an example; however, other
teams can model better structures. Based on the discussions from the
CASP15 conference, the two teams that ranked higher than PEZYFoldings
had some hits for T1130. This may be because T1130 is a eukaryotic
protein, whereas the additional databases used were mainly from
metagenomic sequences, which primarily consist of prokaryotic proteins.
These results highlight the opportunities for improvement in 1) multimer
prediction, 2) building larger and more diverse databases, and 3)
developing tools to predict structures from primary sequences alone. In
addition, transferring the manual intervention process to automation is
a future concern.