Abstract Details

Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study

Colin Batchelor1, Ken Karapetyan2, Valery Tkachenko2, Antony Williams2
1Royal Society of Chemistry, Thomas Graham House, Cambridge, UK CB4 0WF
2Royal Society of Chemistry, Wake Forest, NC USA
A key challenge in managing large molecular databases, particularly ChemSpider, is dealing with variation in depictions of chemical structures and reconciling discrepancies between data sources. The InChI project has gone a long way towards tackling this task, but it is of course not a complete solution for many tasks, particularly biomedical applications where the sorts of tautomerism found in physiological conditions are important.

To this end we present the Chemical Validation and Standardization Platform (CVSP). This uses a wide range of cheminformatics algorithms to check chemical structures for plausibility and canonicalize them. The default ruleset is based closely on the US Food and Drug Administration standardization rules. These address matters such as metal connectivity and the preferred ionization state of molecules with multiple ionizable groups. Naturally we also check for drawing artefacts such as apparent ethane and cyclobutane molecules used to depict wavy lines and boxes. The ruleset is fully configurable by users. We also provide web service nodes for graphical programming tools such as Knime.

An important respect in which we go beyond the FDA rules is the problem of perspective depictions of sugar rings. Algorithms which merely rely on the depiction of bonds as wedges, straight lines or hashes, such as the InChI code, lose the stereochemical information that distinguishes the different anomers of glucose, L-sugars from D-sugars or indeed any hexapyranose at all from any other.

Any such code needs to implement the following steps. Firstly, to identify the conformation of a given hexagon depicted in 2D, whether regular, boat, chair or some other form. We present a taxonomy of hexagons and an algorithm to distinguish them. Next, it needs to determine whether the ring has been depicted in a perspective conformation. These being achieved, regular hexagons do not need to be redrawn, but boat and chair hexagons do. Given the well-rehearsed challenges of laying out chemical structures in the plane, reconstructing the sugar ring in such a way as to minimize disruption to the rest of the molecule is important. The last stage is to identify whether, in crude terms, each substituent points “up” or “down” with respect to the real-life ring, and assign the bonds as wedges, hashes or straight lines accordingly and paying due attention to both IUPAC’s draft recommendations on depicting chemical structures and the stereo interpretation code in common cheminformatics libraries.

We have also applied the algorithm described above, with a small number of modifications, to Haworth-style depictions of five-membered rings.

We present quantitative results of validation on data sets such as DrugBank, the Merck Index data now published by the Royal Society of Chemistry, and ChemSpider itself.

We show how code from the platform has been deployed in order to filter the results of name to structure conversion algorithms, in an ambitious project to text-mine the entire Royal Society of Chemistry journal and book archive.

Lastly, with an eye to the Semantic Web, we discuss methods for representing cheminformatics calculations in RDF, using existing ontologies where possible and present how the results of both physicochemical property prediction and CVSP’s processing can be disseminated as nanopublications in RDF as part of the pan-European Open PHACTS project.

Return to Programme