Figure 1: Distributions across PLC pockets of A) experimental
resolution, B) Difference between R and Rfree, C) Ligand
RSCC, and D) The percentage of protein atoms within 6Å of the ligand,
which have an RSCC > 0.8. Pockets are divided into
categories depending on the number of rotatable bonds of the ligands
they contain. In each panel, the black line shows the suggested
threshold, and the percentage of pockets passing this criterion is
displayed.
In an automated benchmarking setting such as CAMEO, where quality
information is not available at the time the targets are selected,
filtering out even half of the data after predictions have been
generated would be unfortunate, indicating that even the relaxed
criteria are too stringent as a post-filter. An alternative would be to
take quality into account in the scoring process, and downweight low
quality regions of a structure in aggregate scores, without removing the
entire target. Ideally an atom-level weighting would be used, especially
for larger ligands that can display variable levels of quality within
the residue itself. Unfortunately the PDB does not make atom-level
quality information available in the validation reports at the time of
writing, and the only information that would be available is occupancy
numbers which are part of the structural data.
However, analyzing and incorporating validation data is a critical step
towards creating a representative dataset for other benchmarking
settings. For example, of the 255 small molecule pockets in the PDBBind
time-split test-set, 105 do not pass the relaxed criteria, which could
bias the results seen in recent benchmarking efforts using this test
set. Previous efforts have been made to create high-quality subsets of
PDBBind specifically for evaluation purposes28.
However, these produced very small test sets, unlikely to be
representative of the entire protein-ligand space. The stringent Iridium
criteria, the suggested relaxed criteria, and the assessment of novelty
and diversity described in the next section form the basis for the
creation of a representative benchmark dataset. Indeed, similar efforts
to create benchmark sets for PLC are ongoing in the ELIXIR 3D-BioInfo
community29. The
results of that initiative could be incorporated in this assessment once
they are available.
2.2 Is a protein-ligand complex target interesting to
assess?
In the context of large scale structural databases, such as the PDB, it
is possible to encounter several very similar PLC or complexes with the
same protein and ligand that have been crystallized in different
experimental conditions or resolved by means of different experimental
methods. When it comes to automated benchmarking of PLC prediction,
besides the quality of the structure, an important aspect to consider is
the novelty of the PLC to assess.
The CASP15 CASP-PLI assessment9highlighted the superiority of template-based methods to model PLC
accurately. While most top predictions were produced by human groups
rather than automated methods, it is likely that automated methods will
in the future also leverage template information to predict PLC.
Therefore, when generating a benchmarking dataset for PLC prediction, we
need to ensure that PLC are not already represented in the PDB. For a
challenge such as CAMEO, the exact protein conformation and the pose of
the ligand within the protein complex is unknown. Thus, we will use the
sequence as a proxy for protein novelty. As very similar ligands can
have striking differences in their poses, and we would like to retain as
many PLC as possible in the CAMEO pre-filtering setting, we use ligand
names as a proxy for the novelty of the ligand pose. To that end, we
investigated the novelty of the 236,538 small molecule pockets across
75,065 PLC and 32,273 unique small-molecule ligands described in section
1.1.
We assessed the novelty of PLC released every year in the PDB by
verifying whether a particular combination of polymer entities and
ligands was present in previously released structures. For that purpose,
we performed sequence-based clustering of all polymer entities followed
by the assignment of an identifier to each PLC entry, consisting of the
sequence cluster identifiers of each entity and the chemical component
code of the ligands present in the PLC. Using different minimum sequence
identity thresholds helps reveal the level of novelty between the
entities of a PLC compared to previously seen PLC. Similarly, even for
PLC with identical proteins, the combination of ligands seen may differ.
The distribution of sequence clusters and ligand combinations seen per
year is shown in Figure 2, along with the fraction of PLC that pass the
relaxed quality criteria from Section 1. For example, the four different
bars for the 70-90% cluster in the year 2022 represent, in order,(1) all PLC released in 2022 where every entity in the PLC has
70-90% identity to every entity in a matching PLC from a previous year
but the ligands are not all the same, (2) same as (1) but only
the PLC passing the relaxed quality criteria from Section 1 (3)all PLC released in 2022 where every entity has 70-90% identity to
every entity in a matching PLC from a previous year and the ligands are
all the same, and (4) same as (3) but only the PLC passing the
relaxed quality criteria from Section 1.
We see that, from the protein perspective, 78.85% of PLC (and 71.83%
of valid PLC) released in 2022 have at least 30% sequence identity to a
matching PLC from previous years (across all entities). However, most of
these (79.14%) still have different combinations of ligands, indicating
that they may still be interesting to assess for PLC prediction. We
consider two different minimum sequence identity thresholds, 30% for
creating a diverse dataset and 90% for PLC prediction in CAMEO, and
define a PLC as novel if the minimum sequence identity between any of
its entities is less than the threshold in all matching PLC, or at least
one ligand in the PLC is not seen in matching PLC. With this
classification criteria, we found that out of all the PLC released in
2022, 4515 (83.55%) PLC were novel and 889 were redundant at a
threshold of 30%, and 4833 (89.43%) PLC were novel and 571 were
redundant at a threshold of 90%. Hence, even at 30% sequence identity,
83.55% of all released structures contained some kind of novelty, with
at least one previously unseen protein(entity)-ligand combination. Among
the PLC that passed the validation criteria, 2202 (86.76%) PLC were
novel and 336 were redundant at a threshold of 30%, and 2360 (92.99%)
PLC were novel and 178 were redundant at a threshold of 90%.
Thus, most newly-released PLC are novel from either the protein or the
ligand perspective. However, every year some redundant PLC are also
released in the range of 10-20% redundant structures per year, out of
which more than half are highly redundant structures (90-100% sequence
identity and same ligands). The PDBBind time-split test-set also suffers
from a high degree of redundancy, with 62% of the test-set proteins
having >90% sequence identity with other test-set proteins
and 59% having >90% identity to proteins in the
training-set. This indicates that this set would not be able to
accurately represent protein-ligand space, even if all the ligands were
chemically dissimilar, which is not the case.