Figure 2 : Protein-ligand complexes (PLC) released per year (in
brown and orange) and those passing the relaxed quality criteria (in
green and blue), divided according to sequence identity to PLC seen in
previous years. The left two bars of each year (in brown and green) are
PLC with ligand combinations which differ from previous PLC, and the
right two bars (in orange and blue) are PLC containing the same set of
ligands as a matching PLC at that sequence identity.
This approach can be used in CAMEO to select the set of PLC to send out
for prediction, without sacrificing too many PLC and ensuring that
predictors do not waste resources on previously seen PLC or those with
very similar templates. However, this approach has some shortcomings
mainly due to the limited information available to CAMEO when selecting
targets, namely the unique protein sequences and ligand chemical
identities.
First, highly redundant regions or pockets in a PLC might be classified
as novel due to the presence of other novel pockets in different areas
of the complex. On the other hand, small molecule binding poses, even
for the same or very similar chemical compounds, can vary significantly
even within the same protein due to different protein conformations or a
small number of mutations in crucial binding regions. This cannot be
accounted for in the CAMEO pre-filtering step but is useful information
for evaluation and highly necessary for representative dataset creation.
Therefore, utilizing structure and binding pocket clustering from the
protein side and 3D ligand conformation clustering from the small
molecule side is recommended. The same considerations apply to the
oligomeric state of each entity and the stoichiometries of each ligand
in a PLC, information that is not available from the PDB pre-release.
These factors are particularly important when the same ligand is present
in different protein pockets or in cases where a ligand is involved in
protein oligomerization. Therefore, this information must be
incorporated for assessment and when creating a representative benchmark
dataset, and will be explored in future efforts.
2.3 Can we automatically score predicted protein-ligand
complexes?
We developed an automated benchmarking workflow, consisting of two
components: (1) Preprocessing, input preparation, set-up and
running of five PLC prediction tools (Autodock Vina30,31,
SMINA 32,
GNINA 33,
DiffDock34, and
TankBind35) with
different input parameters, and (2) Assessment of PLC
prediction results using different scoring metrics. The workflow is
implemented using Nextflow36 to
enable efficient parallelization and distributed execution, making it
well-suited for handling large datasets and computationally intensive
tasks. Each process is encapsulated in a module, with dependency
management controlled using Conda37 or
Singularity38 . The
resources for each step in the pipeline are defined individually,
ensuring that only the required resources are reserved and failed
processes are automatically restarted with increased resources. Upon
completion, all the predicted binding poses are collected and a summary
of scores is created, along with reporting on resource usage across the
evaluated tools.
We run this workflow using the PDBBind time-split test-set of 363
protein-ligand pockets. As the two most recent deep learning tools in
our set, TankBind and DiffDock, are trained on the remaining proteins in
PDBBind, this is the most fair set to use for their evaluation at the
current time. However, it is important to emphasize that the aim of this
experiment is to demonstrate the feasibility of an automated
benchmarking workflow, and not a comprehensive evaluation of the tools,
due to the issues in this test set already discussed in the previous
sections.
As these tools already take a protein structure as input and we are
interested in extending this to settings where also the structure may be
computationally modeled or in a different conformation, we also
evaluated PLC prediction results on 256 AlphaFold39structures of monomeric proteins from the same test-set. 77% (197) of
the AlphaFold models are within 2 Å RMSD of the crystal structure.
In order to demonstrate the workflow in different input settings, we use
P2Rank 40to detect pockets in each protein in the test set and report results in
two scenarios: Blind docking , which is considered the worst-case
scenario for docking tools where no indication is provided about the
possible location of the ligand, and Best pocket docking ,
representing the best-case scenario where the correct binding pocket is
known and used to define the docking search space. P2Rank was able to
predict the center of the correct binding pocket for 89.2% (324) of the
receptors within 8 Å distance of the true binding site center, defined
as the mean coordinate of the ligand in the pocket. On the other hand,
for the AlphaFold modeled receptors, the percentage was 81.1% (206),
where the ground truth pocket is defined by structural superposition of
the model with the reference structure. For the evaluation of Best
pocket docking, the P2Rank pocket that had the smallest distance from
the true binding site center was considered the best pocket.
The reporting workflow utilizes BiSyRMSD (referred to as RMSD) and
lDDT-PLI scoring to evaluate the predicted ligand structures generated
by the different docking methods. Both of these are novel scoring
metrics developed for the CASP15 CASP-PLI experiment9 that
consider both predicted protein structure and predicted ligand
conformation. In addition, lDDT-PLI focuses on the interactions between
protein and ligand atoms. Table 1 and Table 2 display the outcomes for
PLC prediction using the 363 receptors from the PDBbind test-set and the
256 AlphaFold modeled receptors respectively. The full results are
available as Supplementary Table 1 and 2 for the experimentally solved
and AlphaFold modeled receptors, respectively. The highest ranked pose
(top-1) and the best scored pose out of the top-5 ranked poses (where
the ranking is an output of each tool) are assessed for blind docking
where the entire protein is employed to define the search box.
Furthermore, for all tools except DiffDock where this option is not
present, the same assessment is carried out for the best-case scenario
using the best pocket for defining the search box. Figure 3 depicts the
distributions of these scores for the top-1 and best out of top-5 poses
for experimental and modeled receptors for both docking scenarios.
Table 1 : Prediction of small molecule binding to crystallized
protein structures from the PDBbind testset containing 363 PLC. For some
PLC the pipeline did not complete successfully. Shown are the number of
PLC (n), the success rate (SR) defined as the percentage of predictions
with RMSD < 2 Å, the median RMSD, the mean lDDT-PLI, and the
standard deviation of lDDT-PLI. DiffDock does not use a pocket
definition. TANKBind gives only one prediction per search box.