Coverage Score: A Model Agnostic Method to Efficiently Explore Chemical Space
Daniel J Woodward, Anthony Bradley and Willem van Hoorn
Exscientia, The Schrödinger Building Oxford Science Park, Oxford OX4 4GE, UK
Selecting the most appropriate compounds to synthesise and test is a vital aspect of drug discovery. Unsupervised methods such as clustering techniques and dissimilarity maximisation present weaknesses in selecting the optimal sets for information gain. Active learning (AL) techniques can produce performant models with limited training data, however some AL techniques rely on an initial model and/or computationally expensive semi-supervised batch selection.
Here, we present a new subset-based selection method, Coverage Score,1 that combines Bayesian statistics and information entropy to balance representation and dissimilarity to select a maximally informative subset. Coverage Score can be influenced by prior selections and desirable properties. Subsets selected through Coverage Score are compared against subsets selected through model-independent and model-dependent techniques for several datasets. In drug-like chemical space, Coverage Score consistently selects subsets that lead to more accurate predictions compared to other selection methods. Subsets selected through Coverage Score produced Random Forest models that have a root-mean-square-error up to 12.8% lower than subsets selected at random and can retain up to 99% of the structural dissimilarity of a diversity maximisation selection.
In summary, we have developed a model-independent method for compound selection, which is useful to augment a training set by efficient compound selection, which in turn enables faster model learning.