Imputation of Assay Bioacitivity Data using Deep LearningTM Whitehead1, BWJ Irwin2, P Hunt2, MD Segall2, GJ Conduit1,3 |
|
1Intellegens 2Optibrium 3Cavendish Laboratory, University of Cambridge |
|
The knowledge of compound bioactivity data against drug targets underpins the discovery of new drugs. However, databases of compound bioactivities are currently sparse; for example, the ChEMBL dataset is just 0.05% compete and the sparsity of data in proprietary pharma databases is similar. We will describe a novel deep learning algorithm to capture correlations within protein activity data, as well as between molecular descriptors and protein activities, to impute the missing activities. Unlike many deep learning methods, this approach is capable of taking as input sparse and variable data, typical of those available in drug discovery. We will present examples illustrating the application of these deep learning networks to impute missing activities in the sparse input data, as well as to make predictions for new virtual compounds based on molecular descriptors alone.
The ability to impute unmeasured activities offers access to a vast trove of information. New hits for projects targeting existing biological targets of interest and high-quality compounds, overlooked during optimisation projects, can be identified. Furthermore, compounds with results from early assays can be selected for progression with greater confidence, as downstream results can be accurately predicted. Understanding and exploiting the uncertainties and noise, in experimental data but also in predictions, enables a further increase in the confidence that can be assigned to conclusions drawn from the data. This avoids the possibility of making decisions based on faulty data, and highlights where the deep learning models are most successful. We will present results on proof of concept studies against public domain datasets and also real, internal pharmaceutical data. The results from our method will be compared with those from conventional machine learning methods such as random forests, as well as leading computational chemistry approaches like the profile-QSAR method, and modern machine learning techniques including multi-target deep neural networks and matrix factorisation approaches. This analysis highlights the key advantages that stem from learning inter-assay correlations as well as descriptor-assay correlations, with our new approach able to outperform the leading existing techniques. |