Connected Subgraph Fingerprint: From Theory to Applications
Louis Bellmann1,2, Matthias Rarey1,2
1Universität Hamburg
2ZBH – Center for Bioinformatics
Molecular fingerprints are an efficient and widely applied representation of compounds used in similarity-driven virtual screening and the clustering of compound data sets. A molecule is represented by a set of structural features which are in turn identified with a bit in a bit string or an integer in a vector. Due to their compactness they allow fast comparison of chemical structures. Typically fingerprint methods consider only structural features belonging to a specific family of structures such as path-like, circular or other topologically predefined structures. With modern machine learning techniques and increased computational power it might be beneficial for certain application scenarios to overcome these limitations.
We present the “Connected Subgraph Fingerprint” (CSFP) which derives a set of integers as a representation for chemical structures in four steps. First all distinct connected chemical substructures within an upper atom count limit are enumerated. We present an efficient algorithm following a specific enumeration order to avoid duplicates. In the second step, for each heavy atom contained in a substructure an identifier is derived based on certain atom properties such as chemical element, valence state and connectivity either only within the substructure or the whole compound. In the third step, all atom identifiers are combined to form a unique identifier, representing the complete substructure. To achieve this, we follow the CANGEN [1] procedure using atom identifiers to generate the canonical ordering. After duplicate removal all substructure identifiers are combined to form the fingerprint of the chemical structure where each identifier appears only once in the resulting representation.
The described method is easily adaptable to different application scenarios by applying filters during the enumeration process, choosing alternative atom properties or integrating stereo chemistry.
For evaluation we use an open-source benchmark platform [2] together with the Maximum Unbiased Validation (MUV) data sets [3] and compare different variations of the method to the Extended Connectivity Fingerprint (ECFP) [4]. The experiments on these data sets show that the Connected Subgraph Fingerprint can compete with and in some cases even surpass the performance of the ECFP when predicting biological activity by chemical similarity.
Compared to standard fingerprints like ECFP, the Connected Subgraph Fingerprint is a general approach covering the space of substructure features completely. Besides pure performance values, the new approach has several interesting advantages. The feature space is more complete and easily controllable which makes it an ideal input for machine learning methods working in high dimensions. Furthermore, the approach is compatible to combinatorial fragment spaces [5] opening the way to topological search in chemical space.
1. D. Weininger, A. Weininger, J. L. Weininger (1989) SMILES. 2. Algorithm for Generation of Unique SMILES Notation. J. Chem. Inf. Comput. Sci. 29(2):97-1001
2. S. Riniker, G. A. Landrum (2013) Open-source platform to benchmark fingerprints for ligand-based virtual screening. J. Chem. Inf. Model. 5(1):26
3. S. G. Rohrer, K. Baumann (2009) Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data. J. Chem. Inf. Model. 49(2):169-184
4. D. Rogers, M. Hahn (2010) Extended-connectivity fingerprints. J. Chem. Inf. Model. 50(5):742-754
5. F. Lauck, M. Rarey (2013) Coping with Combinatorial Space in Molecular Design. De novo Molecular Design, Chapter 14:325-347