4 Conclusion

With combined prediction of protein-ligand complexes forming the next frontier for deep learning in computational structural biology, we need approaches for independent, comprehensive and blind assessment of prediction methods to better assess the advantages and shortcomings of classical and novel approaches. Two complementary approaches can be employed for this purpose: weekly continuous evaluation of structures released in the PDB, and the creation of a representative, diverse dataset for benchmarking.
In this study, we examined three challenges essential for establishing such systems in an automated and unsupervised manner: determining whether an experimentally solved PLC can be used as ground truth, assessing the interest or difficulty of a PLC for prediction, and automating the scoring of predicted PLC. In the process, we defined quality criteria for PLC pockets, assessed novelty in the PDB over the years, and developed an automated workflow for PLC prediction and assessment using newly developed scoring metrics. Ligand preparation is a known challenge in docking and throughout our research we faced obstacles in automating ligand preparation, in particular with molecule parsing and protonation.
The PDBBind dataset has been frequently utilized for training deep-learning based docking methods and evaluating their accuracy. Many deep learning methods retained 363 PDBBind PLC as a test set based on their release date after 2019. However, this selection is not ideal for benchmarking, as only half of the structures meet the quality criteria indicating unreliable ground truth, redundancy removal was not performed, and diversity was not considered when choosing the PLC. Consequently, there is a need for a representative dataset that follows the concepts presented in this study.