Modeling results in the context of other CASP14 groups and automated model selection
The results, presented in Table 1 and Supplementary Table S1, do not tell much about our relative success. To investigate our performance in the CASP14 context, we compared our results (group “Venclovas”) with those of three other top performing groups for models designated as first (model 1). We also included our automatic scoring method (“VoroMQA-select-new”) as a virtual group, allowing it to make selections from all CASP14 multimeric models (produced by both automatic servers and human groups). By doing this we aimed to test the effectiveness of our automatic model scoring method in the best possible scenario. For the performance comparison, we used the sum of z-scores of two interface accuracy measures (ICS and IPS) and two global structure accuracy measures (lDDT and TM-score) (Figure 2).
The comparison revealed that different features of our models were predicted with different level of success. According to the accuracy of intersubunit interfaces (ICS and IPS) we achieved the best results. In particular, we were successful in predicting interface patches (IPS), whereas the prediction of specific residue-residue contacts (ICS) was somewhat less successful. On the other hand, the global structure accuracy of our models is not so great compared to other top performing groups. This is especially visible if we consider lDDT, an all-atom score, largely reflecting the accuracy of individual subunits. Interestingly, our automatic model selection method showed relatively strong performance, taking the third position by any of the four scores. Although this method performed worse than our human group on both interface accuracy measures and TM-score, the results according to all-atom accuracy (lDDT) were quite a bit better.
To look at different features in more detail, we examined per-target z-scores. Z-score values were accumulated progressively for targets ordered by the best ICS achieved by any group, which can be interpreted as an estimate of the target difficulty. Figure 3 shows the resulting plots for the models designated as first (model 1). In addition to the data for the same top groups and ”VoroMQA-select-new”, the plots also include the data for the best models provided by any predictor group. The latter curve may be considered as a reference by representing the upper limit of what could have been achieved in CASP14.
Interestingly, the per-target analysis (Fig. 3) revealed that the relative success of different groups was dependent not only on the evaluation measure as seen in Fig. 2, but also on the prediction targets. According to the interface prediction accuracy, our group dominated for most of the targets (Fig. 3A,B). On the other hand, if we consider the global accuracy of models the picture is different. According to TM-score (Fig. 3D) our models are below the state-of-the-art for about half of targets, whereas according to lDDT (Fig. 3C) this is true for nearly all the targets. To see whether our models as assessed by lDDT were indeed significantly inferior to those of other top groups, we examined the cumulative raw values (Supplementary Figure S3). Surprisingly, it turned out that the absolute differences between the groups, especially if evaluated using lDDT (Fig. S3F), are relatively small. This indicates that in most cases subunit structures were of comparable accuracy and that relatively large z-score differences resulted from small structural improvements. The same analysis performed with the CAD-score-based analogs of ICS, IPS and lDDT scores led to similar conclusions (Supplementary Figure S4).
In addition to individual scores, we analyzed their combinations reflecting either the interface prediction accuracy or the accuracy of both the interface and the global structure. We performed this analysis both for models designated as first (Fig. S5) and for the best-of-five models (Fig. S6). The analysis of these combinations has further corroborated above observations on our relative success in the interface prediction and on target-dependent group performance. Interestingly, in the analysis of best-of-five models our automatic selection method (VoroMQA-select-new) was the best according to the interface accuracy (Fig S6A,C) and close to the top according to the combined accuracy (Fig S6B,D). Although having access to all the models VoroMQA-select-new had an important advantage over other groups, the results suggest that this automatic selection procedure is quite robust.