1 Introduction

The latest round of the Critical Assessment of Protein Structure Prediction experiment CASP15, held in 2022, introduced a novel category for protein-ligand interaction prediction (CASP-PLI), aiming to evaluate cutting-edge methodologies on a blind target set of experimentally resolved complexes. In contrast to typical ligand docking benchmark experiments like Teach Discover Treat (TDT)1, Continuous Evaluation of Ligand Prediction Performance (CELPP)2, Drug Discovery Data Resource (D3R)3–6, or Community Structure-Activity Resource (CSAR)7,8, the prediction task in CASP consisted of predicting both the structure of the receptor protein as well as the position and conformation of the ligand, hereafter referred to as protein-ligand complex (PLC) prediction. The evaluation results of this experiment are presented elsewhere in this issue9, as well as the technical details and challenges encountered during the establishment of the new category as part of CASP10. These challenges include, (1) PLC with incomplete ligands or suboptimal quality to be used as ground truth ligand poses, (2) the need for extensive manual verification of data input and prediction output, and (3) the lack of suitable scoring metrics that consider both protein structure and ligand pose prediction accuracy, which necessitated the development of novel scores.
By integrating the insights and developments from the CASP-PLI experiment, automated systems for the continuous benchmarking of combined PLC prediction can be established. We discuss challenges and insights associated with the development of two complementary approaches for PLC benchmarking: a continuous evaluation of newly released PLC in the Protein Data Bank PDB11, as implemented in Continuous Automated Model EvaluatiOn (CAMEO, https://beta.cameo3d.org/)12, and a comprehensive evaluation of PLC prediction tools based on a diverse, curated, and annotated benchmark dataset of PLC.
CAMEO is a benchmarking platform conducting fully automated blind evaluations of three-dimensional protein prediction servers based on the weekly prerelease of sequences of those structures, which are going to be published in the upcoming release of the Protein Data Bank13–15. Since 2012, the 3D structure prediction category has been assessing the accuracy of single-chain predictions. Additional assessment categories have been implemented over time to serve the structural bioinformatics community, in particular around the assessment of quality estimates (QE). Recently, efforts were made towards the assessment of protein-protein complexes (quaternary structures) and protein-ligand pose prediction12.
While CAMEO allows for continuous validation of newly developed methods, it is dependent on the distribution of PLC released in the PDB in a given period. Thus, CAMEO evaluation in a given time period may not be representative of the entire PLC space and method developers may not have immediate access to problem cases or specific sets of PLC where their algorithm under or overperforms. This suggests a second, complementary angle to automated benchmarking, namely the creation of a diverse dataset of PLC with representative complexes from across protein-ligand space, which would allow both global comparative scoring as well as pinpointing cases that method developers would need to address to improve their global performance. While many recent deep-learning docking methods train and validate their approach on the time-split PDBBind set16 of PLC (where 363 protein-ligand pockets are used for benchmarking), we demonstrate that this approach has shortcomings arising from the lack of crystal structure quality verification and the lack of consistent redundancy removal.
Previous research has shown that the quality of experimentally resolved structures can vary significantly17. Efforts have been made to establish criteria for assessing the quality of such structures, like the Iridium criteria18. Comparing prediction results to lower quality structures can skew the perception of their performance, an especially important consideration when assessing deep learning-based tools which have been trained to reproduce results seen in experimentally resolved structures. Additionally, many crystal structures with ligands contain missing atoms or missing residues in the binding site, complicating their use as ground truth.
Even in the era of deep learning, determining the difficulty of predicting a PLC still relies, to some degree, on previously experimentally resolved structures. This was exemplified in this year’s CASP-PLI results9, where template-based docking methods outperformed others due to the availability of previously solved highly similar PLC for many of the targets. Thus, incorporating the novelty of a PLC into automated benchmarking setups is crucial for a fair and comprehensive evaluation. For CAMEO, this consists of filtering out ”easy” targets based on sequence and ligand information available in the PDB pre-release. For the generation of a representative benchmark set, one can additionally look at the novelty of the binding site and ligand pose on a structural level.
Proteins are inherently flexible, exhibiting a range of conformations in line with their functions. Not every observed conformation is compatible with ligand binding, and this can significantly impact the accuracy of docking predictions even when using high quality experimentally resolved structures19,20. These factors are further complicated by the use of computationally predicted protein structures, as previous studies indicate that even state-of-the-art methods for structure prediction are not always suited for the task of ligand docking, due to inaccuracies in conformations and side-chain positioning21. Moreover, some ligands have highly flexible regions that mainly interact with the solvent, where evaluating the conformation of the flexible part may not be as meaningful as the parts of the ligand forming crucial interaction with protein residues. Thus, it is necessary to develop and employ evaluation metrics that extend beyond rigid ligand pose assessments.