Epigenetic Target Fishing through Classification Models based on Statistical-Based Database Fingerprint Similarities
Norberto Sánchez Cruz1,2, Jordi Mestres2, José Luis Medina Franco1
1Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México
2Research Group on Systems Pharmacology, Research Program on Biomedical Informatics (GRIB), IMIM Hospital del Mar Medical Research Institute & University Pompeu Fabra, Parc de Recerca Biomèdica (PRBB)
Dysfunction on epigenetic regulation has been related with the pathogenesis and progression of several diseases including cancer, neurodegenerative disorders and diabetes [1–3]. For this reason, proteins responsible to modulate such process have been promising targets for drug discovery. However, although in the last decade encouraging success has been achieved in the development of epi-probes, epigenetic drug discovery is still challenging , in such a way that so far, only seven epigenetic agents have been approved for human use . Hence there is a need to develop methodologies that accelerate the epigenetic drug discovery process.
In this work we present the application of a recently developed approach, the Statistical-Based Database Fingerprint (SB-DFP) , to build single fingerprint representations of 28 epigenetic data sets from the epigenomics database  and its implementation in target fishing campaigns based on target-specific classification models.
Keywords: similarity searching, target fishing, molecular fingerprints.
The epigenomics database used in this study contains information about epigenetic target associations for over 7800 unique compounds and 60 different targets. For our analysis, we selected the information for 28 targets for which there was at least 50 reported compounds with a potency of 10µM or better, labeled as “active”, while the remaining compounds were labeled as “inactive”. The selected targets include bromodomain-containing proteins (BRD2, BRD3 and BRD4), histone acetyltransferases (CREBBP and EP300), DNA methyltransferase (DNMT1), histone lysine methyltransferase (EHMT2), histone deacetylases (HDAC1-HDAC11), lysine acetyltransferase (KAT2B), lysine demethylases (KDM1A and KDM4C), histone methyl-lysine binding proteins (L3MBTL1 and L3MBTL3), mitogen-activated protein kinase (MAP3K7), O-GlcNAcase (MGEA5), nuclear receptor coactivators with histone acetyltransferase activity (NCOA1 and NCOA3), and protein arginine methyltransferase 1 (PRMT1).
From the subset described, we selected the compounds associated with more than three epigenetic targets to be used as validation set for the target fishing model and the remaining compounds to train and test the target-specific similarity searching approaches.
The SB-DFP is a single fingerprint representation of compound data sets, which can be constructed based on virtually any molecular fingerprint. The approach consists is comparing the frequency occurrences of each bit in a molecular fingerprint for a target data set and a reference set (representing the “known chemical space”), in such a way that a bit is set to “1” in the SB-DFP only if the occurrence frequency for that bit in the target data set is statistically higher than in the reference set. In this work we used Extended Connectivity Fingerprint with bond diameter of 4 (ECFP4) as molecular fingerprint, 15,403,690 compounds from ZINC12 database as reference set and a Z-test for the frequency occurrences comparisons.
Each of the 28 epigenetic datasets was divided randomly 100 times into train and test sets with a 3 to 1 ratio. The active compounds in the train set were used to build the SB-DFP and the similarity values (measured as Tanimoto coefficient) from the compounds in this set and the SB-DFP were used to build different target-specific classification models. The test set is being used to evaluate the performance of the classification models in terms of area under the receiver operating characteristic curve (AUC) and enrichment factor at 1% screening. The best performing models will be selected to build different target fishing models and compounds in the validation set will be used to evaluate the performance of such models in terms of the fraction of correctly predicted known targets. The results will be discussed in this presentation.
1. Sarkar, Heerboth S, Lapinska K, et al (2014) Use of Epigenetic Drugs in Disease: An Overview. Genet Epigenet 1:9. https://doi.org/10.4137/GEG.S12270
2. Dawson MA, Kouzarides T (2012) Cancer Epigenetics: From Mechanism to Therapy. Cell 150:12–27. https://doi.org/10.1016/j.cell.2012.06.013
3. Hwang J-Y, Aromolaran KA, Zukin RS (2017) The emerging field of epigenetics in neurodegeneration and neuroprotection. Nat Rev Neurosci 18:347–361. https://doi.org/10.1038/nrn.2017.46
4. Shortt J, Ott CJ, Johnstone RW, Bradner JE (2017) A chemical probe toolbox for dissecting the cancer epigenome. Nat Rev Cancer 17:160–183. https://doi.org/10.1038/nrc.2016.148
5. Lu W, Zhang R, Jiang H, et al (2018) Computer-Aided Drug Design in Epigenetics. Front Chem 6:. https://doi.org/10.3389/fchem.2018.00057
6. Sánchez-Cruz N, Medina-Franco JL (2018) Statistical-based database fingerprint: chemical space dependent representation of compound databases. J Cheminform 10:55. https://doi.org/10.1186/s13321-018-0311-x
7. Naveja JJ, Medina-Franco JL (2017) Insights from pharmacological similarity of epigenetic targets in epipolypharmacology. Drug Discov Today. https://doi.org/10.1016/j.drudis.2017.10.006