Noel O’Boyle Abstract

A medicinal chemistry based measure of R group similarity

Noel M. O’Boyle1, Roger A. Sayle1

1NextMove Software
Molecular similarity is one of the most central concepts in chemoinformatics. Typical measures of molecular similarity (such as the Tanimoto coefficient of binary fingerprints) are used for tasks such as similarity search, distinguishing similar molecules from dissimilar (e.g. identifying actives in a virtual screen), measuring the diversity of a dataset or selecting a diverse subset.

Recently we compared different molecular fingerprints in the context of molecular similarity by testing how well they ranked molecules by similarity to a query [1]. We found that the results differed quite significantly depending on whether the molecules to be ranked were highly similar or more diverse; the widely-used ECFP4 fingerprint performed best for the more diverse set, but surprisingly was outperformed by atom pair fingerprints when ranking highly similar molecules. In that study, ‘highly similar molecules’ were drawn from the same ChEMBL assay and were often matched molecular pairs (molecules that differed only in the identity of an R group). This suggests that when dealing with similarity between matched molecular pairs (or equivalently, R groups, if we ignore the specific environment), methods developed to measure global structural similarity are not the appropriate choice.

Improved methods to measure R group similarity are of particular importance in the context of a medicinal chemistry project. These often proceed by changing one R group at a time, advancing through matched pairs. Given an appropriate measure of R group similarity, it should be possible to suggest relevant modifications or identify gaps in the project data that should be filled. In a computational context, R group enumeration could be used to generate relevant candidate molecules for virtual screening or purchase.

In medicinal chemistry, the term bioisosteric replacement refers to a substitution that retains broadly similar biological properties. However, there is a need to go beyond the concept of bioisoteres/non-bioisosteres to handle levels of similarity; for example, chloro is regarded as more similar to fluoro than to amino. Clearly, in this context, the extent to which two R groups are similar is not just (or not even) a question of shared substructures. Previous approaches to this problem include the use of R group descriptors by Holliday et al [2] which maps atom-based descriptor values onto a vector by distance from the attachment point, and the use of reduced graphs to encode bioisosteric equivalences by Birchall et al.[3]

We propose to use co-occurrence in medicinal chemistry project data to derive a measure of R group similarity. We will use two distinct sources for these data, both in the public domain. The first is the ChEMBL database, which contains assay data extracted from a range of medicinal chemistry journals. The other source is the US patent literature, which provides text and ChemDraw sketches from which a large amount of medicinal chemistry data can be extracted [4]. While large-scale mining of medicinal chemistry data has previously been used to detect bioisosteres (see for example [5]), to our knowledge this is the first time it has been used to develop a method to measure R group similarity.

1. O’Boyle NM, Sayle RA (2016) Comparing structural fingerprints using a literature-based similarity benchmark. Journal of Cheminformatics 8:36.
2. Holliday JD, Jelfs SP, Willett P, Gedeck P (2003) Calculation of Intersubstituent Similarity Using R-Group Descriptors. J Chem Inf Comput Sci 43:406–411.
3. Birchall K, Gillet VJ, Willett P, et al (2009) Use of Reduced Graphs To Encode Bioisosterism for Similarity-Based Virtual Screening. J Chem Inf Model 49:1330–1346.
4. Mayfield, JW (2016) Sketchy Sketches: Hiding Chemistry in Plain Sight, Poster presented at 7th Joint Sheffield Conference on Chemoinformatics, Sheffield,
5. Wassermann AM, Bajorath J (2011) Large-scale exploration of bioisosteric replacements on the basis of matched molecular pairs. Future Medicinal Chemistry 3:425–436.