Poster 27: OPSIN: Taming the Jungle of IUPAC Chemical NomenclatureDaniel M. Lowe1, 2, Peter Murray-Rust1, Robert C. Glen1
|1Unilever Centre for Molecular Science Informatics, Department of Chemistry, Lensfield Road, Cambridge CB2 1EW, UK.|
2NextMove Software Ltd, Innovation Centre Unit 23, Science Park, Milton Road, Cambridge CB4 0EY, UK
|Chemical names are a common way of communicating chemical structure information. Chemical names broadly fall into three classes: trivial, semi-systematic and systematic names. A trivial name cannot be broken down into smaller portions of text whilst still retaining structural meaning. Instead to know the structure of a trivial name the mapping must be present in a dictionary. A systematic name can be fully derived using a set of rules such as those provided by IUPAC or CAS (Chemical Abstracts) whilst a semi-systematic name will have a trivial name at its core which has been modified systematically. As an example Caffeine (trivial name) may also be named 1,3,7-trimethylxanthine (semi-systematic) or 1,3,7-trimethyl-1H-purine-2,6(3H,7H)-dione (systematic).|
OPSIN (Open Parser for Systematic IUPAC Nomenclature) is an open source freely available program for converting chemical names, especially those that are systematic in nature, to chemical structures. The software is available as a Java library, command-line interface and as a web service. OPSIN accepts names that conform to either IUPAC or CAS nomenclature and can convert them to SMILES, InChI and CML (Chemical Markup Language).
Chemical name-to-structure facilitates structure searching of the chemical literature and can be used in combination with natural language processing techniques to extract chemical reactions. OPSIN is already employed in large-scale text mining applications such as SureChem and entity resolvers such as the Chemical Identifier Resolver.
OPSIN employs a regular grammar that describes IUPAC nomenclature and uses this to break chemical names into their constituent parts and assign the meanings to these name parts. For example ethene is broken down into ‘eth’ and ‘ene’ with the former meaning the creation of a chain of two carbons and the latter indicating that there is an unsaturated bond. Next, the parts of the name are grouped together and nested according to brackets in the name to yield a parse tree. To generate the connection table, nomenclature operations are applied successively to this tree with an in-memory connection table being modified to reflect these operations. Finally the structure is serialised to SMILES, InChI or CML.
OPSIN has grown from covering only simple general organic chemical nomenclature to the point of having competent coverage of all areas of organic chemical nomenclature. One of the most recent additions is comprehensive support for the nomenclature of carbohydrates. This brings support for dialdoses, diketoses, ketoaldoses, alditols, aldonic acids, uronic acids, aldaric acids, glycosides and oligosacchardides, in both the open chain and cyclic forms, named systematically or from trivial sugar stems with support for modification terms such as anhydro or deoxy.
OPSIN’s support for specialised and general organic nomenclature will be demonstrated through illustrative examples and accompanying performance metrics. We focus in particular on areas of nomenclature for which support was recently added and those that are complex to implement such as fused ring nomenclature.
(1) Lowe, D. M.; Corbett, P. T.; Murray-Rust, P.; Glen, R. C. Chemical Name to Structure: OPSIN, an Open Source Solution. J. Chem. Inf. Model. 2011, 51, 739–753.
(2) OPSIN Source Code and Downloads on Bitbucket. http://bitbucket.org/dan2097/opsin/
(3) Lowe, D. M. OPSIN Web Service. http://opsin.ch.cam.ac.uk/
(4) Lowe, D. M. Automated Extraction of Reactions from the Patent Literature. CINF#75, 243rd ACS National Meeting & Exposition, San Diego, CA, March 27, 2012.
(5) Digital Science. SureChem. https://surechem.com/
(6) Sitzmann, M. NCI/CADD Chemical Identifier Resolver. http://cactus.nci.nih.gov/chemical/structure