Abstract Details


Poster 21: Ligand-Based Virtual Screening and Visualization in the SMILES Chemical Space

Julian Schwartz1, Jean-Louis Reymond1
1Department of Chemistry and Biochemistry, University of Berne, Freiestrasse 3, 3012 Berne, Switzerland
To perform in silico similarity searches in virtual chemical databases it is necessary to represent molecules by a descriptor set which encodes selected properties of the molecule. While a broad variety of different descriptor sets exist, and an infinite variety is thinkable, the most commonly used descriptors for similarity searches are based on substructural fragments. The usual representation for such descriptors is a fixed-length bit-string, where each bit encodes for either the presence (1) or the absence (0) of a certain substructure or fragment within the molecule. The Tanimoto-Coefficient, which is typically used as similarity coefficient for binary fingerprints, calculates the similarity between two molecules as the ratio of the number of bits “on” in both molecules to the number of bits “on” in either molecule[1].

We recently introduced the Molecular Quantum Number (MQN)[2] classification for organic molecules, which is a 42-dimensional scalar descriptor set that counts atom types and bond types, as well as polarity and topological features of molecules, to name an example of scalar molecular fingerprints.

Herein, we present and demonstrate the usefulness of a related 34 dimensional scalar descriptor set for molecules that can be readily derived from the SMILES[3] representation of molecules. Our SMILES-Fingerprint (SMIfp) is generated by counting the occurrences of 34 distinct characters used in the SMILES representation, and thus its generation is extremely fast and does not require any sophisticated chemical interpretation software. The SMIfp, in combination with City-Block distance as distance metric, allows not only to rapidly search and screen large databases like GDB-13[4] (977 million molecules), but also to significantly enrich annotated bioactive compounds from the Directory of Useful Decoys[5].

To demonstrate the use of the SMIfp, sets of 10,000 randomly selected compounds from the top scoring 0.01% of GDB-13 in SMIfp similarity to 15 different commercial drugs were retrieved and analyzed in terms of shape similarity to the parent drug using ROCS. All sets show on average higher shape similarity to the parent drug than a random selection of GDB-13 molecules, suggesting that some of the retrieved molecules might also share biological activities with the parent drugs.

In contrast to binary fingerprints, an n-dimensional scalar descriptor set can be interpreted to span an n-dimensional Chemical Space where molecules with similar descriptors, and thus similar structures and presumably similar biological activities, cluster in the same region of the chemical space. We show that the resulting SMILES Chemical Space is chemically meaningful and can be visually inspected as a two-dimensional projection after application of Principal Component Analysis (PCA), as exemplified for large databases such as GDB-13 or PubChem[6].

[1] P. Willet, J. M. Barnard and G. M. Downs, J. Chem. Inf. Comut. Sci. 1998, 38, 983-996.
[2] K. T. Nguyen, L. C. Blum, R. van Deursen, J.-L. Reymond, ChemMedChem 2009, 4, 1803-1805.
[3] D. Weininger, J. Chem. Inf. Comput. Sci. 1988, 28 (1), 31-36.
[4] L. C. Blum and J.-L. Reymond, J. Am. Chem. Soc. 2009, 131, 8732.
[5] N. Huang, B. K. Shoichet and J. J. Irwin, J. Med. Chem. 2006, 49 (23), 6789-6801.
[6] Y. Wang, J. Xiao, T. O. Suzek, J. Zhang, J. Wang, S. H. Bryant, Nucleic. Acids. Res. 2009, 37, W623-W633.

Return to Programme