Assay Descriptors for Improved Bioactivity Prediction Performance
L. Schoenmaker, W. Jespers, J.B. Beltman and G.J.P. van Westen
Division of Drug Discovery and Safety, Leiden, The Netherlands
With the growing insight into the mechanisms of toxicity and the advances in computational modeling, in silico methods have the potential to become a key instrument in chemical safety assessment. To explore this approach, we are developing the Virtual Human Platform for Safety Assessment (VHP4Safety) . Central to predicting an adverse outcome, is linking the internal exposure to toxic effects via molecular initiating event modelling . This can be achieved by predicting protein-ligand interactions with proteochemometric (PCM) models .
Previous efforts have led to the establishment of large databases for bioactivity modeling . However, a major issue with this agglomerated data is that bioactivity on a protein can be measured in different assays that do not necessarily correlate well . Normally, this diversity is either ignored or sometimes preference is given to certain assay readouts. Recently, it has been shown that discerning between different assays by using assay identifiers as additional model input or as separate tasks, improves PCM model performance .
This work adds more detailed assay information to improve PCM models further. To this end, we have combined multiple methods. We used a bag-of-words model to extract features from existing assay descriptions as we have done previously using paper abstracts . These features were then used as input for methods to define assay similarity metrics and to perform clustering of assays. In addition to this, we compared to a neural topic modeling approach . Ultimately, we constructed PCM models on this enriched dataset that performed better than the current state-of-the-art.
The VHP4Safety project NWA 1292.19.272 is part of the NWA research program ‘Research along Routes by Consortia (ORC)’ and is funded by the Netherlands Organization for Scientific Research (NWO) and coordinated by Utrecht University, Utrecht University of Applied Sciences, and RIVM.
 Virtual Human Platform for Safety Assessment. https://vhp4safety.nl/
 Pittman ME, Edwards SW, Ives C, Mortensen HM (2018) AOP-DB: A database resource for the exploration of Adverse Outcome Pathways through integrated association networks. Toxicol Appl Pharmacol 343:71–83. https://doi.org/10.1016/j.taap.2018.02.006
 van Westen GJP, Wegner JK, IJzerman AP, van Vlijmen HWT, Bender A (2011) Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets. Medchemcomm 2:16–30. https://doi.org/10.1039/C0MD00165A
 Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR (2017) The ChEMBL database in 2017. Nucleic Acids Res 45:D945–D954. https://doi.org/10.1093/nar/gkw1074
 Tang J, Szwajda A, Shakyawar S, Xu T, Hintsanen P, Wennerberg K, Aittokallio T (2014) Making Sense of Large-Scale Kinase Inhibitor Bioactivity Data Sets: A Comparative and Integrative Analysis. J Chem Inf Model 54:735–743. https://doi.org/10.1021/ci400709d
 Pentina A, Clevert D-A (2022) Multi-task Proteochemometric Modelling
 Papadatos G, van Westen GJP, Croset S, Santos R, Trubian S, Overington JP (2014) A document classifier for medicinal chemistry publications trained on the ChEMBL corpus. J Cheminform 6:40. https://doi.org/10.1186/s13321-014-0040-8
 Grootendorst M (2022) BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:220305794