MATERIALS AND METHODS
PEZYFoldings in CASP15
Overall pipeline
A schematic representation of the pipeline is shown in Fig. 1A. The
default AF2 pipeline process can be broadly divided into the following
steps: MSA construction, structure prediction, and relaxation using
OpenMM18. The main differences between the default AF2
pipeline and the PEZYFoldings pipeline include a more extensive sequence
similarity search in the MSA construction step and the introduction of
refinement steps. Details of each step are described in the following
sections.
Sequence similarity search and MSA
construction
The MSAs constructed in the pipeline are summarized below. In addition,
the URLs and data downloaded from the databases are listed in Table S1.
PZLAST-MSA : Query sequences were submitted to the
PZLAST10,11 web API service with option
“max_out=10000.” Because hits from PZLAST are fragmented sequences
directly translated from sequencer reads, I aligned them with jackhmmer19,20 and assembled them using a simple script; if
aligned regions of two sequences were longer than 20 aa and the regions
had an identity > 95 %, the sequences were merged.
PSIBLAST-MSA : PZLAST-MSA was inputted to
PSI-BLAST15 version 2.13.0 with
PSI-BLASTexB16 customization with the -in_msa option
and options “-evalue 0.00001 -outfmt \”6 qseqid sallacc
evalue pident nident qlen staxids sseq\”
-max_target_seqs 100000 -num_threads 128.” nr14and an in-house metagenomic database (described later) were searched
simultaneously. In the early season, the number of iterations was set to
two. In the later season (from T1173), it was changed to search
iteratively up to three times using Position Specific Scoring Matrix
(PSSM) checkpoint files; when the number of hit sequences was greater
than 10,000, the iteration was terminated. If the final number of hit
sequences was small (<10,000), PZLAST-MSA was merged.
Sequences were aligned using jackhmmer. The taxonomy IDs of the
sequences were added to a TaxID tag, which was used for sequence pairing
in a later step.
HHBLITS-UNIREF-MSA : PSIBLAST-MSA was inputted to
hhblits21 (hhsuite22 v3.3.0) using
the UniRef3023 database. With options ”-all -n 3 -cpu
128.”
HHBLITS-BFD-MSA : PSIBLAST-MSA was inputted to hhblits using the
BFD24 database with options “-all -n 2 -cpu 6.” If
the number of sequences in the MSA was larger than 10000, the MSA was
filtered using hhfilter with options “-cov 30 -id 100 -diff 10000.”
JACKHMMER-UNIPROT-MSA : A query sequence was inputted to
jackhammer (hmmer19,20 suite 3.3.2) using the
Uniprot25 database with options “–cpu 128 -E
0.00001 -N 3.”
JACKHMMER-MGNIFY-MSA : A query sequence was inputted to
jackhmmer using the MGnify26 database with options
“–cpu 128 -E 0.00001 -N 3.” If the number of sequences in the MSA
was larger than 10000, the MSA was filtered using hhfilter with options
“-cov 30 -id 100 -diff 10000.”
Final input MSA : PSIBLAST-MSA, HHBLITS-UNIREF-MSA,
HHBLITS-BFD-MSA, JACKHMMER-UNIPROT-MSA, and JACKHMMER-MGNIFY-MSA were
concatenated and filtered using hhfilter with options “-id 100 -cov 30
-maxseq 500000.”
Construction procedure of the in-house metagenomic
database
The metadata of the assembly entries was downloaded from the NCBI FTP
site on 2022-03-28. e entries that had “metagenome” in their
description were extracted. The entries were checked to see whether they
had translated_cds.faa, protein.faa.gz, cds_from_genomic.fna.gz,
rna_from_genomic.fna.gz, or genomic.fna.gz in this order of priority.
If the sequence data were nucleotides, they were translated using
prodigal27 with the default settings. If the prodigal
could not be processed using the default settings, the “-p meta”
option was used. A unique ID was generated for each entry and considered
a taxonomy ID.
MSA filtering and feature
building
After constructing the MSAs, I filtered them using several criteria and
created variations of the MSAs according to the sequence identities: 1)
clustered with sequence identity 95 %, 2) clustered with sequence
identity 90 %, 3) filtered out if sequence identity with the query was
less than 80 %, 4) filtered out if sequence identity with the query was
less than 60 %, and 5) no identity filters were applied. Filtering was
performed using hhfilter. I used the “-cov 30” option; however, in the
middle of the season, I noticed that all unpaired sequences of a subunit
were filtered out if the subunit length was less than 30 % of the total
length of the multimeric structures. Therefore, the coverage values
changed arbitrarily during the season. The input features for the AF2
networks are created in this step. This step allows flexible
manipulation of the input features for AF2; for example, one can
deliberately pair or unpair sequences, such as the
AF2Complex28 , and provide sparse residue indices to
generate partial structures. I added extra gaps (the residue index)
between subunits to predict multimer structures with the monomer version
of AF2 13,28. TaxID tags or OX tags in the headers of
the FASTA entries were used to pair sequences in the MSAs. TaxID tags
were added to the headers of the sequences extracted from the nr and
in-house metagenomic database. When sufficient computational resources
were available, features were also created with skipping the pairing
step. For antibody-antigen complexes, the paring step was always skipped
(the sequences for H1140 were paired because of my error). In addition,
I provided a3m files to the official feature-building pipeline and
created input features for the network considering the possibility that
I had bugs in my scripts.
Structure prediction by AlphaFold2 or
AlphaFold-Multimer
The prediction was made with normal AlphaFold2
(model_1~5) and AlphaFold-Multimer parameters
(model_1~5_multimer_v2) downloaded from
https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar
on 2022-03-11. The number of recycling steps was typically set from 5 to
30, considering the time and computer resources. Intermediate structures
were produced during recycling. Therefore, the pipeline produced
approximately 1000-2000 structures in standard cases.
Model ranking and
selection
The process of model ranking and selection involved utilizing the
self-confidence metrics generated by AF2 as the criteria. For monomer
targets, a sum of per-residue plDDTs higher than 70 was used because of
the possibility of disordered regions. For multimer targets, the
weighted sum of the predicted TM-score29 (iptm × 0.8 +
ptm × 0.2)4 was used. When I predicted multimer
structures with the monomer version, as it did not produce multimer
metrics, all the unrelaxed structures were processed with the refinement
model (see below). The top models were typically selected. For the rest
of the submission, the TM-score software or MM-align30was used to maintain the variation in the structures (e.g., highly
similar structures were not selected), considering ensembles,
alternative forms, or mispredictions. Various human interventions were
utilized in this step due to numerous issues that needed to be
addressed. For example, models in which subunits did not interact with
other subunits often had low TM-scores with other models and were
selected in the semi-automatic pipeline. However, such models were
avoided, as it was evident that the prediction was incomplete.
Refinement
I constructed a deep-learning model that refined the predicted
structures by fine-tuning the official AlphaFold-Multimer weight
(model_1_multimer_v2). It uses a predicted structure and its amino
acid sequence as the input and output refined structures. Further
details on this model were provided in the independent
paper17. The training conditions employed for the
model used in CASP15 are listed in Table S2. The five structures
selected as submission candidates were input into the refinement model.
When sufficient time and resources were available, all predicted
structures except the intermediate ones were fed into the refinement
model.
Manual interventions
Domain parsing
Structures were usually predicted using full-length sequences of all
subunits. However, when the total number of amino acids was large and
could not be handled with my GPU, I performed domain parsing and MSA
cropping or predicted the entire structure using the CPU mode. Domain
parsing is divided into several steps. First, the sequences were split
into fragments of lengths ranging from 500 to 1000 aa and selected with
a random guess. In addition, I sometimes used the results of domain
prediction using SMART31. Next, the structures were
predicted using AF2, and the regions or subunits that interacted with
them were visually inspected. Then, I decided on new boundaries to avoid
disturbing the interface. Subsequently, the structures were predicted
again, and the resulting models were assessed. The boundary decision and
partial structure building steps were repeated until the quality of the
partial structures was satisfactory. They were then concatenated with
simple scripts, which performed structural alignment using the
overlapped regions.
MSA depth arrangement
In cases where the targets encompassed markedly conserved domains, the
resulting MSAs sometimes displayed considerable depth imbalances (Fig.
1B). If the depth of the MSA was highly skewed, sequences with amino
acids in the sharrow regions were retained, and other sequences were
randomly selected to flatten the depth (Fig. 1C). When the depth was
insufficient, additional searches were performed to obtain additional
sequences around the sharrow regions.
Visual inspections of the refined
structures
Because the refinement model was trained with globular
proteins17, it sometimes produced globular structures
(Fig. 1D, 1E) or many atom clashes. Therefore, I visually inspected the
models, and if I observed any problems in the refined models, I did not
use them.
Comparison with other teams’
models
As ColabFold32 team, NBIS-AF2-standard team, and
NBIS-AF2-multimer team provided publicly available prediction results, I
compared their models with my models and assessed whether the protocols
worked well. If I perceived my model’s quality as inferior to that of
other teams, I undertook protocol revision by conducting extra sequence
similarity searches or augmenting the number of recycling steps.
Docking or de novo-like structure
prediction by the refinement
model
When I could not build good structures using my basic pipeline, I
performed docking or de novo -like structure prediction using the
refinement model. The process for this approach was straightforward.
When the predicted chains were randomly moved and fed into the
refinement model, the model created complexes from the chains.
Similarly, by feeding a structure with randomly placed atoms into the
refinement model resulted in the generation of a reasonable structure.
Target-specific process
Some other interventions such as point mutations on T1109 were
conducted. A concise summary of target-specific processes can be found
in Supplementary Text 1.
Assessment of the impact of individual
element
Impact of extended sequence similarity
search
To investigate the impact of the MSA construction protocol without
manual intervention, I compared the MSAs with the baseline MSAs
generated using the default settings of the AF2 pipeline. Targets less
than or equal to 1,200 aa were considered because long sequences require
manual intervention to avoid out-of-memory errors. Baseline MSAs
provided by the NBIS-AF2-standard and NBIS-AF2-multimer teams were
downloaded from http://duffman.it.liu.se/casp15 on 2022-12-27. The
subunits of the assembly targets were predicted using AlphaFold-Multimer
as the assembly entries. The number of sequences (Nseq) in MSA was
calculated as the number of clusters using cd-hit33with the option “-c 1.0 -G 0 -n 5 -aS 0.9 -M 64000 -T 8.” Feature
building was performed without identity filtering. The structures were
predicted using AF2 by setting the number of recycling steps to 15.
Z-M1-GDT (Z-scores of MODEL 1 based on GDT-TS) were extracted from TSV
files downloaded from the CASP15 website.
Impact of the refinement
model
To evaluate the effect of the refinement model, the precision of the
intermediate structures was measured by comparing their accuracy before
and after refinement. The intermediate structures of the submitted
models were collected from the backup files.