Helle W. van den Maagdenberg Abstract

QSPRpred: A Flexible and Open Quantitative Structure-Property Relationship Modelling Tool

Helle W. van den Maagdenberg¹, Linde Schoenmaker ¹, Martin Sicho^1,2, Olivier J. M. Béquignon¹, Sohvi Luukkonen¹, David Araripe^1,3, J.G. Coen van Hasselt¹, Piet H. van der Graaf^1,4 and Gerard J. P. van Westen¹

¹ Leiden Academic Centre of Drug Research, Leiden University, 55 Einsteinweg, 2333 CC Leiden, The Netherlands
² CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technick ́a 5, 166 28, Prague, Czech Republic
³ Department of Human Genetics, Postzone S-04-P, Leiden University Medical Centre (LUMC), P.O. Box 9600, 2300 RC Leiden, The Netherlands
⁴Certara, University Road, Canterbury Innovation Centre, Unit 43, CT2 7FG Canterbury, Kent, UK

Quantitative Structure-Property Relationship (QSPR) modelling has been embraced as a powerful tool by both industry and academia [1]. It is a computational modelling technique for predicting the relationship between structural characteristics of chemical entities and their properties. QSPR modelling plays a pivotal role in virtual screening and De Novo drug design [2]. Currently, a number of tools are available to assist researchers with QSPR modelling (e.g. [3]). Many cheminformaticians prefer the flexibility of Python, supported by packages like scikit-learn [4], RDkit [5] and PyTorch [6], to construct novel architectures over tools that have a set of prespecified models. However, experimenting with many different models and workflows will quickly increase the complexity of the code. Therefore we have developed QSPRpred to simplify the task of developing novel QSPR models while maintaining flexibility. Due to its modular structure users can easily incorporate new models and features while still providing a base workflow to keep the code organized.

With QSPRpred one can build regression and single-class/multi-class classification models. As QSPRpred is built mainly on RDkit and scikit-learn models it can easily be incorporated into other Python cheminformatics projects, such as De Novo generators [7]. Data and models are serialized in a transferable form so that the processing workflows and models can be shared between systems and users. Data pre-processing steps are provided, including filtering and transforming the input data, molecule cleaning, molecular descriptor calculation, feature filtering and data splitting. It contains common cheminformatics features specific to working with molecular data (e.g. fast link to Papyrus [8], SMILES standardization and sanitization, chemical space visualization integration). The model
training allows for cross-validation and hyper-parameter optimization through Bayesian optimization with Optuna [9] or grid search. QSPRpred supports a selection of scikit-learn [4] models and a PyTorch [6] fully-connected neural network has been pre-implemented in the program. Standard data preparation and model training steps can be achieved through the command line interface or customized further through the Python API so that users can train a wide variety of QSPR models. Tutorials are provided to help users get started. Furthermore, QSPRpred will include functionality for more complex models, such as multi-task and proteochemometric models.

In conclusion, QSPRpred provides a standardized but adaptable pipeline for QSPR modelling. Here we will discuss how QSPRpred can be applied in a model development workflow and show an example of a use case; creating CYP substrate classification models. The code can be found through the Leiden Computational Drug Discovery GitHub page at https://github.com/CDDLeiden/QSPRpred.

References
[1] Artem Cherkasov, Eugene N. Muratov, et al. “QSAR Modeling: Where Have You Been? Where Are You Going To?” In: Journal of Medicinal Chemistry 57.12 (2014), pp. 4977–5010.
[2] Mingyang Wang, Zhe Wang, et al. “Deep learning approaches for de novo drug design: An
overview”. In: Current Opinion in Structural Biology 72 (2022), pp. 135–144.
[3] Kevin Yang, Kyle Swanson, et al. “Analyzing Learned Molecular Representations for Property Prediction”. In: Journal of Chemical Information and Modeling 59.8 (2019), pp. 3370–3388.
[4] F. Pedregosa, G. Varoquaux, et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830.
[5] RDKit. RDKit: Open-source cheminformatics.
[6] Adam Paszke, Sam Gross, et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In: Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019, pp. 8024–8035.
[7] M. Sicho, S. Luukkonen, et al. DrugEx: Deep Learning Models and Tools for Exploration of Drug-like Chemical Space. In preparation. 2023.
[8] O J M B ́equignon, B J Bongers, et al. “Papyrus: a large-scale curated dataset aimed at bioactivity predictions”. In: Journal of Cheminformatics 15.1 (2023), p. 3.
[9] Takuya Akiba, Shotaro Sano, et al. “Optuna: A Next-generation Hyperparameter Optimization Framework”. In: Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2019.