Evaluation of the Papyrus Dataset for Kinase Binding Affinity Prediction
Rachael Skyner, Maria Musgaard and Ben Tehan
One of the most challenging parts of building models in AI/ML is the selection or construction of a suitable dataset. Publicly available datasets contain a trove of information on proteins, ligands, and their interactions, but the quality of data varies in quality and is subject to experimental error. More focused datasets tend to be limited to a much smaller selection of information that may not be suitable for models requiring lots of data, such as neural networks. Furthermore, Klarner et. al. have found that datasets often used as benchmarks for evaluating models with respect to the performance of others exhibit systematic experimental errors, which can lead to confounding statistical dependencies when multi-task models are used .
Béquignon et al. have recently released the Papyrus dataset  that aims to alleviate problems in data quality and range applied to bioactivity predictions. It consists of around 60 million data points, which have been standardized and normalized for application to machine learning, combining multiple large and small publicly available datasets.
To further build upon the excellent work of Béquignon et al., we present an open-source adaptation of the dataset as a postgres database for integration with other frameworks, and a python API for interrogating data within the database utilizing the RDKit cartridge and razi .
In this work we investigate the utility of this new dataset, in the postgres format with associated API, to generate models for kinase binding affinity prediction. Kinases play a central role in virtually all signal transduction networks, and so are common targets in drug discovery. Molecules designed as kinase inhibitors often exhibit off-target binding for seemingly unrelated kinases, meaning an understanding of selectivity across the kinome for any potential selective inhibitor is crucial.
 Béquignon, O.J.M., Bongers, B.J., Jespers, W. et al. Papyrus: a large-scale curated dataset aimed at bioactivity predictions. J Cheminform 15, 3 (2023). https://doi.org/10.1186/s13321-022-00672-x