Kunal Roy Abstract

A new workflow for QSAR model development from small data sets: Integration of data curation, double cross-validation and consensus prediction tools

Pravin Ambure1, Agnieszka Gajewicz2, M. Natalia D. S. Cordeiro1, Kunal Roy3

1LAQV@REQUIMTE/Department of Chemistry and Biochemistry, University of Porto, 4169-007 Porto, Portugal
2Laboratory of Environmental Chemometrics, Faculty of Chemistry, University of Gdansk, Gdansk, Poland
3Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700032, India
Quantitative structure-activity/property relationship (QSAR/QSPR) models have long been applied in drug design and predictive toxicology in addition to their more recent applications in materials science, food sciences, nanosciences, etc1. These models are used mainly for two purposes, prediction of the endpoint values for untested chemicals for data gap filling, and physico-chemical and mechanistic interpretations of the structure-response relationships. We frequently come across small data sets (with number of data points 25 -50) for some specialized endpoints such as toxicity of nanomaterials, properties of catalysts, etc. Due to shortage of experimental data for such endpoints, it is desirable to develop (QSAR/QSPR) models in order to fill data gaps. However, it is difficult to develop a properly validated and robust QSAR/QSPR model from a small data set due to several reasons, including not using a part of the available data for model development for test set validation, bias in descriptor selection due to a fixed composition of the small training set, presence of outliers with respect to both chemical and biological domains in the training data which influence modeling of the data set, etc. To address these problems, we suggest here a workflow involving modeling of the whole small data set (i.e., without data set division) integrating three major steps: data curation, double cross-validation and consensus predictions. In the data curation step, we try to identify structural and response range outliers (compounds which are sufficiently different from the rest with respect to chemical features and/or response values)2, and also activity cliffs (compounds which are similar in chemical features, but much different in response values). For double cross-validation, we carry out leave-many-out cross-validation in different iterations3 and select the best model based on the lowest error of the respective validation sets4. The training set is repeatedly split k times into calibration and validation data sets. The calibration objects are used to develop different models whereas the validation objects are used to estimate the models’ error. Finally, the best selected models are applied for consensus predictions for the query compounds using simple average and weighted average (weighting based on the mean absolute error computed from leave-one-out prediction errors of all training compounds) of predictions5. The strength of a consensus approach is that the final result takes into account the different assumptions characterizing each model, encompassing different chemical features and their contributions allowing for a more reliable judgment in a complex situation. The drawback of one model can also be nullified by another model used in consensus predictions. Finally, to demonstrate the applicability of the workflow, case studies are performed using a few small data sets. The suggested workflow has been made available for free public use in the form of a software tool in the site https://dtclab.webs.com/software-tools .

References
1. Roy K (editor), Advances in QSAR Modeling. Applications in Pharmaceutical, Chemical, Food, Agricultural and Environmental Sciences. Springer, 2017, http://www.springer.com/in/book/9783319568492
2. Roy K, Kar S, Ambure P, On a simple approach for determining applicability domain of QSAR models Chemom Intell Lab Sys, 145, 2015, 22-29, http://dx.doi.org/10.1016/j.chemolab.2015.04.013
3. Roy K, Ambure P, The “double cross-validation” tool for MLR QSAR model development. Chemom Intell Lab Sys, 159, 2016, 108-126, http://dx.doi.org/10.1016/j.chemolab.2016.10.009
4. Roy K, Das RN, Ambure P, Aher RB, Be aware of error measures. Further studies on validation of predictive QSAR models. Chemom Intell Lab Sys, 152, 2016, 18-33, http://dx.doi.org/10.1016/j.chemolab.2016.01.008
5. Roy K, Ambure P, Kar S, Ojha PK, Is it possible to improve the quality of predictions from an “intelligent” use of multiple QSAR/QSPR/QSTR models? J Chemom 32, 2018, e2992, http://dx.doi.org/10.1002/cem.2992