Development of QSAR Models for SBDD of GPCRs
Jon D. Tyzack, Daniel Santos Stone, Noel M. O’Boyle and Chris de Graaf
Sosei Heptares, Cambridge, UK
The preparation and application of QSAR models a pharmaceutical context is driven by their utility in pushing forward drug-discovery projects, where the emphasis is on producing good local models for a particular chemical series rather than attempting universally applicable global models. A key consideration in live drug discovery projects is that data is continually being generated so it is important to adapt the models to the chemical space being explored. Here we present examples of our approach to the development of QSAR models for the chemical space of interest to our internal drug discovery projects.
The standard practice for assessing a particular machine-learning (ML) model is to build it on a static dataset and measure its performance via cross-validation or on an unseen test set. This is in stark contrast with the prospective application of the model in a situation where project data is continually generated and the model updated; for example, a trivial observation is that (prospectively) predictions for molecules synthesised at the start of a project will tend to be poorer than for molecules synthesised later. We propose an alternative way to assess QSAR model performance that reflects this prospective application.
Commercial QSAR models have been designed to have good global performance over a broad chemical space but may not perform well for a particular local chemical series of interest. To address this, some software allows the user to tune the model with additional in-house data. Here we show that oversampling provides a degree of control over the extent to which in-house data modulates a pre-trained model. In combination with the prospective assessment method described above, this allows the user to find the right balance between accuracy and overfitting.