Identifying Drug Repositioning Candidates using Representation Learning on Heterogeneous Networks
Charles Tapley Hoyt1,2, Daniel Domingo-Fernández1,2, Mehdi Ali2, Rana Aldisi1,2, Lingling Xu1,2, Martin Hofmann-Apitius1,2
1Department of Bioinformatics, Fraunhofer SCAI, Sankt Augustin 53754, Germany
2Rheinische Friedrich Friedrich-Wilhelms-Universität Bonn, Bonn 53115, Germany
Because of the high costs and high failure rate of drug discovery and development, drug repositioning (DR) is an attractive alternative in which previously approved drugs can be ascribed new uses. Previous computational approaches have often focused on two subtasks: predicting new drug-target interactions and predicting new target-disease associations as an indirect means of identifying novel drug-disease associations. First, public chemogenomics databases such as ChEMBL and ExCAPE-DB typically support the exploration of chemical space and training of classifiers for compound activities against specific proteins. Second, public gene-disease and target-disease databases such as DisGeNET and Open Targets support the prediction of new target-disease associations. Alternatively, target prioritization methods combining disease-specific experimental data with complete interactomes have also generated new target-disease association predictions.
While several methodologies have built on these core ideas to integrate information about drugs’ side effects, expression profiles in various cell lines or models induced by drugs, and the functionality of proteins, they are typically bespoke; based on linear algebra that is rigid to changes in dimensionality and ultimately intractable with regards to the integration of new data. Knowledge graphs (KGs) present a solution in which new data can be integrated readily and a variety of methods can be applied, regardless of the schema. A prominent example of a highly heterogeneous KG used in DR is from Himmelstein et al. (2017), in which the authors engineered and subsequently selected topological features based on weighted path counts. However, feature engineering is also burdensome because of the necessary time for generation and interpretation.
An alternative is to use network representation learning (NRL) methods, whose goals are, for a given KG, to generate low-dimensional vector representations of nodes whose elements correspond to latent features of the KG. Here, we propose and evaluate using three classes of NRL for the link prediction task in DR using the previously mentioned data sets: 1) random walk-based methods (e.g., DeepWalk, Node2Vec, GAT2VEC), 2) translational methods (e.g., TransE, TransD, TransH), and 3) convolutional neural network-based methods (e.g., ConvE). Finally, we compare and contrast the different classes and methods of NRL used for DR.
The concept of proteochemometrics has been recently introduced to describe the usage of both chemical-chemical and protein-protein similarity to simultaneously predict chemical-target activities across previously challenging target and chemical landscapes. We propose NRL methods as tools to accomplish the goals of proteochemometrics with new ability to easily incorporate heterogeneous information.
Finally, we consider some of the drawbacks of using KGs and KGE models for link prediction. While proteochemometric approaches rely on the continuous range of chemical and protein similarities (arising from topological fingerprints, high throughput screen fingerprints, learned embeddings, etc.), KGs must use a cutoff to represent them as discrete edges. We finally consider and propose new methodological improvements that might be able to leverage the rich information afforded by these similarity metrics.
References
Himmelstein, D. S., et al. (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. ELife, 6. https://doi.org/10.7554/eLife.26726