Benchmarking Molecular Fingerprints
Alexander G. Hasson⋆,1,2, Helen M. Byrne2, Eric O’Neill1 and Garrett M. Morris3
1 Mathematical Institute, University of Oxford
2 Department of Oncology, University of Oxford
3 Department of Statistics, University of Oxford
We present an open source and scalable framework for the generation and comparative analysis of molecular fingerprints. This frame work is implemented in the performant programming language, Rust. Molecular fingerprints are a fundamental cheminformatics tool. They are used in computer aided drug discovery (CADD), such as in virtual screen ing. Fingerprints also provide means to interrogate chemical space. The functions to generate fingerprints are diverse. Fingerprinting methods can fall into several ‘families’, and have their various implementations scattered in a multitude of programming languages and/or libraries. The current literature suggests that which fingerprint to use is use–case specific. Consensus exists for some use–cases, such as to use ‘atom-pair‘ fingerprints for comparing large molecules (e.g. peptides), and ‘substruc ture‘ fingerprints to compare small molecules.
Benchmarks for some cheminformatics applications exist[1, 2]. Such as virtual screening, and predicting drug-drug interactions. These bench marks suggest which type of fingerprints to use for the given application setting, but can be limited in the number of different fingerprint methods tested.
There is also the question of how well a given fingerprint reflects the nature of compounds or molecules, in the context of chemical space. Furthermore, there are no agreed upon benchmarks for evaluating the computational cost of fingerprint methods.
This framework brings together fingerprints previously implemented in entirely different packages (and/or languages), such as ‘RDKit’ (Python, C++) and ‘CDK’ (Java)). These implementations were compiled into a common performant language (Rust), and enables users greater customisation; such as the generation of ‘Circular’ fingerprints with path diameters greater than 6 (the maximum in CDK).
We use this framework to evaluate existing fingerprint methods in chem informatics applications, such as virtual screening and similarity ensem ble approaches (SEA), as well as on their computational complexity.
Lastly, we present the first such benchmark to be applied to ultra scale molecular databases, such as ZINC22 (37 billion molecules) and REAL23 (69 billion). To facilitate this, we utilise open-source, scalable, and fingerprint-agnostic algorithms as alternatives to commercial products such as ‘Arthor’.
References Sereina Riniker and Gregory A Landrum. Open–source platform to benchmark fingerprints for ligand-based virtual screening. Journal of Cheminformatics, 5(1), 2013. https://doi.org/10.1186/1758-2946-5-26.  Noel M. O’Boyle and Roger A. Sayle. Comparing structural fingerprints using a literature-based similarity benchmark. Journal of Cheminformatics, 8(1), 2016. https://doi.org/10.1186/s13321-016-0148-0.  Alice Capecchi, Daniel Probst, and Jean-Louis Reymond. One molecular fingerprint to rule them all: Drugs, biomolecules, and the metabolome. Journal of Cheminfor matics, 12(1), 2020. https://doi.org/10.1186/s13321-020-00445-4.  B Zagidullin, Z Wang, Y Guan, E Pitk¨anen, and J Tang. Comparative analysis of molecular fingerprints in prediction of drug combination effects. Briefings in Bioinformatics, 22(6), 2021. https://doi.org/10.1093/bib/bbab291.  Greg Landrum, Paolo Tosco, Brian Kelley, Ric, Sriniker, Gedeck, Riccardo Vianello, David Cosgrove, Schneider Nadine, Eisuke Kawashima, and et al. Rdkit/rdkit: 2022 09 4 (Q3 2022). 2022. https://doi.org/10.5281/zenodo.7541264.  The CDK Project. Chemistry Development Kit. 2019. URL https://cdk.github. io/.  Benjamin I. Tingle, Khanh G. Tang, Mar Castanon, John J. Gutierrez, Munkhzul Khurelbaatar, Chinzorig Dandarchuluun, Yurii S. Moroz, and John J. Irwin. Zinc–22—a free multi-billion-scale database of tangible compounds for ligand dis covery. Journal of Chemical Information and Modeling, 2023. https://doi.org/ 10.1021/acs.jcim.2c01253.  Oleksandr O. Grygorenko, Dmytro S. Radchenko, Igor Dziuba, Alexander Chup rina, Kateryna E. Gubina, and Yurii S. Moroz. Generating multibillion chemical space of readily accessible screening compounds. iScience, 23(11):101681, 2020. https://doi.org/10.1016/j.isci.2020.101681.