INTRODUCTION

The AlphaFold2 algorithm, developed by DeepMind, has demonstrated remarkable performance in protein structure prediction in Critical Assessment of Structure Prediction (CASP) 141-3. This was followed by the development of the AlphaFold-Multimer algorithm4, which can predict multimeric structures with high accuracy. (Hereafter, I call both AlphaFold2 and AlphaFold-Multimer AF2 unless there is a specific need to differentiate them.) Other protein structure prediction programs have emerged following the success of AF25-7. However, AF2 demonstrated comparable or even better performance than the newer programs. Hence, optimizing AF2 is considered one of the most promising strategies for achieving the highest accuracy in protein structure prediction tasks.
Therefore, the challenges in CASP15 were as follows: (1) collecting a sufficient number of evolutionarily related sequences for input into AF2. (2) improving the structures generated by AF2.
Protein structure prediction tools are known to exhibit poor performance when there is a limited number of evolutionarily related sequences. Although AF2 exhibits reduced sensitivity to this problem, it remains a concern1,8. As a result, the collection of evolutionarily related sequences is a crucial step in the process. Utilizing large metagenomic databases is a prominent strategy for addressing this challenge9. Therefore, in addition to the databases employed in the official AF2 pipeline, I used PZLAST10,11 to collect more metagenomic sequences. Furthermore, an in-house database was constructed using NCBI assembly12 data to obtain sequences with taxonomic information because it was considered to be necessary to predict multimeric structures4,13. The nr database14, a widely used extensive collection of sequences, was included and searched using a customized version of PSI-BLAST15,16.
To accomplish the second objective, a deep learning model was constructed to improve the accuracy of the predicted structures. Additionally, it was assumed that AF2 (and other structure-prediction software using Multiple Sequence Alignments [MSAs]) required MSAs for high-quality prediction. However, they can be disrupted by the MSAs at the same time. For example, antibody complementary-determining regions are sequence-specific; therefore, the amino acids in MSA should not be considered. The details of this model have been described in the independent paper for the model17. Although the model was primarily designed to refine multimeric structures, it was considered to be useful to refine monomeric structures because the underlying principles must be similar.
For the CASP15 project, I devised a semi-automatic pipeline with several issues that need to be rectified. For example, AF2 can handle up to approximately 2200 amino acids (aa) in my environment. Therefore, if the number of amino acids was large, the sequences were cut into small pieces for prediction. In addition, the conserved domains had many hits, then the number of hits covering other regions was relatively small. In this case, sequences were sampled to flatten the MSA depth. Furthermore, many target-specific interventions exist because of various targets, including mutated proteins and targets required for predicting ensemble structures.
As a result, my team got third place with GDT-TS, first place with Assessor’s formulae in the single-domain category, and tenth place in the multimer category, which showed that my approach could achieve state-of-the-art performance. However, several problems have resulted in poor predictions, as described in this manuscript.