Predicting Athletic Performance-enhancing Substances by Protein Target PredictionLazaros Mavridis1, John B.O. Mitchell1
|1University of St Andrews|
|The World Anti-Doping Agency (WADA) publish the Prohibited List, an international standard for identifying substances and methods prohibited in-competition, out-of-competition and in particular sports. Ideally, one would like to identify, in a fast and cost effective way, all substances that have one or more performance-enhancing pharmacological actions. Databases such as ChEMBL bring together in one place bioactivity data for hundreds of thousands of different molecules and for thousands of protein targets. When combined with information from sources such as DrugBank, these can also be associated with specific biological and pharmacological activities. Those molecules which have been investigated in different assays may have activities listed against more than one target. A known limitation in molecular bioactivity data is that not every compound has been experimentally assayed against all targets, thus the matrix of available molecule-target data is sparse. Chemoinformatics target prediction methods can fill in these gaps with predicted data, allowing the bioactivity spectrum of a molecule’s activity against the whole panel of targets to be assessed. |
In this work, we use experimental data derived from the ChEMBL database (~7,000,000 activity records for 1,300,000 compounds) to build a database model that takes into account both structure and experimental information. The ChEMBL database was screened and eight well populated categories of activities were used for a rule-based filtering process to define the labels “active” or “inactive” (Ki, Kd, EC50, ED50, activity, potency, inhibition and IC50). Using only the active compounds for each ChEMBL family, a clustering step was performed in order to split families with more than one distinct scaffold. This produced clustered bioactivity based families, where members share both a common chemical scaffold and a similar profile of bioactivity against targets in ChEMBL. We have used the Parzen-Rosenblatt machine learning approach to make predictions of membership of these clustered bioactivity-based families for compounds where we do not have experimental data. We hope that other groups will test these predictions experimentally in the future. Validation tests using compounds in the WADA prohibited list show very good agreement with experimental data when these are available.
Our predictions include well known associations between the anabolic steroids and the androgen receptors, as well as more subtle connections with bacterial and protozoan families (Mycobacterium tuberculosis and Trypanosoma brucei) that are backed up by the literature. Furthermore, our refined classes allow compounds with different scaffolds to be predicted for the same ChEMBL family; we will show how two different kinds of cannabinoids are predicted for both the cannabinoid CB1/2 receptors while their chemical structures are quite distinct.
Our work will allow early identification of potential doping molecules. These compounds can then be prioritised for experimental testing, while no further experiments need to be conducted on those with negative in silico predictions. The use of this computational technology will massively reduce the need for animal or human experiments. Our results can be interpreted as a quantitative definition of the “similar chemical structure” criterion, based on similar predicted protein-target interactions, which will prevent inactive molecules being prohibited and hence protect athletes against unjust disqualification.