RESULTS AND DISCUSSIONS

Overall performance

My team, PEZYFoldings got third place with GDT-TS (First place with the Assessor’s formulae) in the single-domain category and tenth place in the multimer category. Looking at the ranking on the all submitted models, PEZYFoldings got fourth place with GDT-TS (First place with the Assessor’s formulae) in the single-domain category and fourth place in the multimer category. The improved ranking in the multimer category, considering all submitted models, suggests that there is room for enhancement in my ranking and selection process for multimeric structures.
After the competition, all generated models, including unsubmitted ones, were assessed based on their TM-scores29 (Fig. S1, S2). Optimal chain mapping for the multimer targets were obtained using US-align34,35 or MM-align30. TM-scores were calculated using TM-score software. Among the 93 single-domain targets with available ground truth structures, 10 targets had superior models displaying significant TM-score differences (>0.1) compared to the submitted models. Likewise, among the 36 multimer targets with available ground truth structures, three targets possessed better models exhibiting substantial TM-score differences (>0.1) compared to the submitted models. It is important to note that these results cannot be directly attributed to the inadequacy of my model ranking and selection procedure, as some models were not processed in the ranking and selection step. Instead, the results suggest the full potential of the structure prediction components.

Notable targets

In this section, I will discuss specific targets that are likely to be of particular interest to readers.

T1130

For T1130, I could obtain hits only from the MGnify26database. However, the identities between the hits and the query was low. In addition, the target was described as an aphid protein; therefore, the hits were suspicious. Furthermore, confidence scores of the resulting models were poor. I performed de novo -like structure prediction using the refinement model and built approximately 5,000 models. And structures with relatively high self-confidence scores were submitted; however, these scores were lower than those of the usual targets. The plDDT of the top structure was 68.99. This did not reach the level often observed in successful predictions, which tend to have plDDT above 80. Assessment after the competition showed that all the produced structures had an insufficient TM-score (Fig. S1). According to discussions during the CASP15 conference, the two teams with the highest ranks in the single-domain category had hits for T1130. Therefore, my poor performance on this target was caused by a deficiency in the sequence similarity search conditions.

H1137

Due to the large size of H1137, domain parsing was performed through visual inspections. Initially, the features derived from the MSA were divided into multiple segments and concatenated, resulting in approximately 1000-2000 amino acids in total and the predicted partial structures (step 1, Table S3). Utilizing the outcomes of step 1, I constructed additional partial structures (step 1.5, Table S3) to verify the accuracy of my assumptions regarding subunit interactions. These predictions suggested that the N-terminal regions of s1-s6 interacted with s8 and s9, while s7 interacted with s8 and s9. Consequently, I constructed partial structures using: 1) N-terminal regions of s1-s6 and full-length s8 and full-length s9; 2) middle part of s1-s6; 3) C-terminal regions of s1-s6; 4) N-terminal regions of s1-s6, N-terminal regions of s7, full-length s8, and full-length s9; 5) full-length s7; and 6) GFP domain of s7 (step 2, Table S3). The predicted partial structures in step 2 were concatenated, and the subunit structures were extracted. Note that because complete structures of H1137 were intended to be built at the submission date of H1137, structures submitted as independent subunit structures were from partially concatenated structures.
Similar to the usual monomer targets, the sum of plDDTs was used as the selection criterion. The refinement was not performed because the performance of the refinement model on a partial structure was considered poor and the entire assembly structure was too large to process.
According to the single-domain category results in CASP15, I achieved Z-scores greater than 2.0 for six targets. Five of these six targets were the helical domains of H1137 subunits (D2 domains of s1, s2, s3, s4, and s5), which were challenging to predict as monomers. Hence, domain parsing that considers the interface was essential for my high performance.

T1173-D2

Regarding T1173, the semi-automatic protocol did not yield structures that displayed promising results when compared to the ColabFold32 results (Fig. 2C, first panel and forth panel). Using Quick BLASTP15,36,37 search, T1173 was observed to be part of a longer sequence. Therefore, I extended 196 aa from the N-terminus using a longer sequence and predicted the structures again. After constructing several structures, I noticed that the quality of the C-terminal region (based on visual inspection, Fig. 2C, second panel) was inferior compared to the ColabFold result (Fig. 2C, fourth panel). I examined the depth of the MSA and observed a highly skewed distribution; the deepest part had more than 400,000 sequences, while the last ten aa region had fewer than 1,000 sequences (Fig. 2A). Consequently, I selected hits in the final ten aa, randomly selected 500 sequences from the original MSA, and flattened their distribution (Figs. 2A, 2B). The resulting models appeared satisfactory (Fig. 2C, third panel). Nevertheless, it is important to note that other groups submitted more accurate structures. The N-terminal region of D2 (positions 63-113 in the full-length sequence) in my MODEL 1 could not be aligned with the ground truth structure using TM-score software. Enhanced structures might have been achieved if I had included more sequences from positions 63-113.

Assessment of the impact of individual elements

Impact of the extended sequence similarity search

To examine the impact of the extended sequence similarity search process, I conducted an assessment of predictions after the competition, focusing on the differences in input MSAs. I employed two types of MSAs for predicting target structures. The first MSA set comprised MSAs utilized by PEZYFoldings (PEZY-MSA), while the second set was generated using the default settings of the AlphaFold2 or AlphaFold-Multimer pipeline provided by the NBIS-AF2-standard and NBIS-AF2-multimer teams (NBIS-MSA). I examined targets with a total length of 1200 aa or less. However, out-of-memory errors occurred for T1124, T1132, and T1174. Consequently, 54 single-domain targets were investigated. PEZY-MSA had at least one more sequences than NBIS-MSA, except for the targets T1133-D1, T1131-D1, T1122-D1, and T1119-D1 (Table S4) . Thus, I can confirm that I obtained more evolutionarily related sequences than the default settings in over 90% of cases. The number of sequences in PEZY-MSA for specific targets could be smaller than those of NBIS-MSA due to: 1) running hhblits21 against UniRef3023 and BFD24 separately; 2) not using the UniRef9038 database; 3) the number of iterations against BFD was changed to two from three; 4) using a more stringent e-value (0.00001) for jackhmmer19,20compared to the default settings (0.0001); and 5) applying hhfilter22 to hits from BFD and MGnify. When clustering sequences with a sequence identity threshold of 62%, a criterion for effective sequence counts used in previous studies13,39, I obtained larger values than the default settings in 43 of the 54 cases (Table S4).
The ΔTM-score (TM-score of structures with PEZY-MSA minus the TM-score of structures with NBIS-MSA) as a function of Nseq-NBIS-MSA (number of sequences in NBIS-MSA) is illustrated in Figs. 3C and 3D. Seven and five targets demonstrated a ΔTM-score >0.05 for MODEL 1 (the model with the highest confidence) and the best model among the five generated models, respectively (Fig. 3C, 3D). All targets with a ΔTM-score >0.05 had an Nseq-NBIS-MSA of less than 1000. The ΔTM-score for targets with Nseq-NBIS-MSA greater than 1000 was minimal, which is consistent with the results in the original publication; the quality of predictions by AlphaFold2 increases until the number of sequences or Neff is approximately 100-10001. This trend was also observed in the CASP15 results. Among the 53 targets, I had nine targets with a Z-score greater than 1.0, and seven out of those nine targets had an Nseq-NBIS-MSA of less than 1000 (Figure 4E, Table S4).

Impact of the deep-learning-based refinement model

The TM-scores of the models submitted to the competition website were collected to investigate the refinement model’s effectiveness. The TM-scores before and after the last refinement are summarized in Fig. 4. Models subjected to docking or de novo -like structure predictions were excluded. The refinement model improved the quality of some predicted structures (Figs. 4A, 4D); however, from the point of view of the performance in the competition, the differences in the TM-score were indistinguishable (Figs. 4B, 4C, 4E, 4F). In other words, although the refined structures had better accuracy than the original structures, the other structures achieved the same or better levels of accuracy without refinement. In CASP15, there were seven conventional antibody-antigen or nanobody-antigen targets. I could build three out of seven targets with an average DockQ40 score >0.49, which meets the medium-quality threshold in CAPRI41 criteria. As mentioned in the introduction, the refinement model was anticipated to perform well with antibodies. However, the results obtained from the model indicate that further efforts are required to reach the desired level of success.