Towards an intuitive measure of similarity in medicinal chemistry space using graph edit cost
Wing Ki Wong1, Charlotte M. Deane1, Roger A. Sayle2, William R. Pitt3
1Department of Statistics, University of Oxford, Oxford OX1 3LB, U.K.
2NextMove Software Ltd., Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge, CB4 0EY, U.K.
3UCB, 208 Bath Road, Slough, SL1 3WE, U.K.
Molecules can be represented by graphs. Maximum common subgraphs (MCS) have been used to infer substructure matches between molecules, hence serve as a metric of chemical similarity. Based on the concept of maximum common edge subgraph, the SmallWorld  algorithm uses the graph edit distance (GED) to quantify chemical similarity. In this work, we extend this functionality by exploring the possibility of incorporating an edit cost matrix and compare the performance against a standard SmallWorld search.
GED is the minimum cost of edits required to transform one molecule into another . It can be considered as an analogue to the String Edit Distance widely used in language processing  and biological sequence comparison . Based on the frequency of edit operations, such as insertion/deletion (indel) and substitution of characters between two strings, an edit cost is associated with each operation to indicate the reward or penalty for undergoing the path of transformation. In bioinformatics, substitution matrices for amino acids, including PAM  and BLOSUM , have been derived from the observations of the sequences across closely-related proteins. Similar weighted edit cost may also be applied in chemoinformatics, where conservative edits that preserve bioisosterism are considered favourably over drastic chemical changes. Substitution patterns have been observed in molecular fragment databases  and matched molecular pair analysis . These analyses revealed that certain transformations can be over-represented depending on the bias in the data set.
In this work, we survey the common edit operations explored by the SmallWorld algorithm, and attempt to parameterise the edit cost of the GED algorithm. Datasets from the public domain and from UCB’s medicinal chemistry programs are used for this purpose. Our overall aim is to produce similarity measures that concur with medicinal chemists’ intuition. This poster details our progress towards this goal.
 Sayle, R. A., Batista, J., & Grant, J. A. (2013). Efficient maximum common subgraph (MCS) searching of large chemical databases. Journal of Cheminformatics, 5(1), O15.
 Sanfeliu, A., & Fu, K. S. (1983). A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, (3), 353-362.
 Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3), 171-176.
 Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.
 Dayhoff, M. O. (1972). A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, 5, 89-99.
 Henikoff, S., & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences, 89(22), 10915-10919.
 Hall, R. J., Murray, C. W., & Verdonk, M. L. (2017). The Fragment Network: A Chemistry Recommendation Engine Built Using a Graph Database. Journal of Medicinal Chemistry, 60(14), 6440-6450.
 Tyrchan, C., & Evertsson, E. (2017). Matched molecular pair analysis in short: algorithms, applications and limitations. Computational and Structural Biotechnology Journal, 15, 86-90.