Further Steps Towards Predicting Bioactivity by Traversing Knowledge Graphs
Terence Egbelo1, Vlad Sykora2, Michael Bodkin2, Ziqi Zhang1 and Val Gillet1
1 Information School, University of Sheffield, The Wave, 2 Whitham Rd, Sheffield S10 2AH, United Kingdom
2 Insilico R&D, Evotec (UK) Ltd, 114 Park Drive, Abingdon OX14 4RZ, United Kingdom
A biomedical knowledge graph is a heterogeneous information network integrating the relationships between entities such as genes, proteins, compounds and diseases. A variety of properties relevant to drug discovery are encoded as direct links between entities, for example chemical similarities between pairs of druglike compounds. These properties may correlate with more complex patterns within the graph such as the tendency of similar compounds to have similar biological effects.
This poster summarises preliminary work to tackle the prediction of compound bioactivity in protein assays as a knowledge graph completion problem. Here, knowledge graph completion is a classification task where an Active or Inactive label is to be assigned to each prospective compound-assay pairing. The objective is thus to learn classifier models that can separate Compound-Active-Assay knowledge graph “triples” from Compound-Inactive-Assay triples and generalise to correctly infer activity and inactivity in unseen compound-assay pairings.
The data set used is an extract of Evotec’s comprehensive proprietary biomedical knowledge graph created by integrating public data sources from the areas of proteomics, chemistry and pharmacology. Data from Ensembl (Gene and Protein nodes), ChEMBL (Compound, Assay and Measurement nodes), the Experimental Factor Ontology (Disease nodes) and the Gene Ontology (Biological Process nodes), among others, have been merged into a single graph. The extract graph used in this workflow represents experimental data from a set of kinase assays.
Inspired by previous research (Lao et al 2011, Fu et al 2016, Himmelstein et al 2017), the approach used in this study leverages observable knowledge graph topological properties to tackle the knowledge graph completion problem. In particular, each Compound-Active(Inactive)-Assay triple in the data set is characterised by a feature vector whose elements are the counts of a set of distinct path types (“metapaths”) in the sample KG that connect the compound and the assay. The resulting feature matrix allows the training and evaluation of classifier models.
In this poster, we describe the process of generating feature vectors from the knowledge graph that form the inputs for the training, validation and testing of Random Forest models. We then explore:
- Train-test splitting policies in the knowledge graph machine learning context
- The impact of the amount of data available; specifically, how model performance changes based on the availability of data from many vs few separate assays
- The tendency, within this set of ChEMBL kinase assays, of lower IC50 / EC50 measurements to appear in assays with many measurements and higher IC50 / EC50 measurements to appear in assays with few measurements
Future work will focus on validation of the metapath approach on an additional, more balanced extract of the complete knowledge graph.
References Lao, N., Mitchell, T., & Cohen, W. (2011, July). Random walk inference and learning in a large scale knowledge base. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 529-539).  Fu, G., Ding, Y., Seal, A., Chen, B., Sun, Y., & Bolton, E. (2016). Predicting drug target interactions using meta-path-based semantic network analysis. BMC bioinformatics, 17(1), 1-10.  Himmelstein, D. S., Lizee, A., Hessler, C., Brueggeman, L., Chen, S. L., Hadley, D., … & Baranzini, S. E. (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife, 6, e26726.Galárraga, L. A., Teflioudi, C., Hose, K., & Suchanek, F. (2013, May). AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In Proceedings of the 22nd international conference on World Wide Web (pp. 413-422).  Horn, A. (1951). On sentences which are true of direct unions of algebras. The Journal of Symbolic Logic, 16(1), 14-21.