Abstract Details


Poster 22: Macromolecules or Big Small-Molecules? Handling Biopolymers in a Chemical Registry System

Noel M O'Boyle1, Evan Bolton2, Roger A Sayle1
1NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge, CB4 0EY, UK
2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda MD 20894, USA
In 2012 the best-selling drug in the US was adalimumab, a TNF-inhibitor used to treat rheumatoid arthritis and other inflammatory diseases. Rather than being a traditional small molecule drug, this is a protein (specifically a monoclonal antibody) with a molecular weight of ~140kDa. This is an example of the increasing importance of biologics in the pharmaceutical industry. The question then arises: what is the best way to store their structural information in chemical registry systems designed for small molecules?

Existing systems and their associated toolkits provide a mature set of tools to handle small-molecule data, from generating depictions, to creating and reading linear representations (such as SMILES and InChI). However, such tools do not translate well to the domain of biopolymers where the key information is the identity of the repeating unit and the nature of the connections between them. For example, a typical all-atom 2D depiction of all but the smallest protein or oligosaccharide obscures this key structural information.

One approach to solving this problem is to develop a new macromolecule representation that can efficiently capture the components of a macromolecule and the intercomponent linkages. For example, a team at Pfizer have developed the HELM language [1] and associated graphical tools to enter and display this information for macromolecules which many contain any of peptides, oligosaccharides, oligonucleotides and arbitrary chemical groups attached to each other. However, the drawback of such an approach is that this will require an update of an existing chemical registry system or indeed the establishment of a separate system that can handle the new data structure. Furthermore, there may be problems storing molecules that do not exactly fit the macromolecule representation.

We propose an alternative approach: macromolecules should simply be treated in the same way as small molecules, and stored in the same chemical registry systems as connection tables (or SMILES). Efficient perception routines should then be used to interconvert between this all-atom representation and the various reduced-atom representations already established for proteins, oligosaccharides [2] and so forth. This approach means that no additional infrastructure is required in terms of the chemical registry system, and that all of the existing in-house or toolkit functionality developed for small molecules (e.g. substructure searching) will be retained for macromolecules.

I describe a suite of tools which allows seamless interconversion between appropriate structure representations for small molecules and biopolymers (with a focus on polypeptides and oligosaccharides). For example:
SMILES: OC[C@H]1O[C@@H](O[C@@H]2[C@@H](CO)OC([C@@H]([C@H]2O)NC(=O)C)O)[C@@H]([C@H]([C@H]1O)O[C@@]1(C[C@H](O)[C@H]([C@@H](O1)[C@@H]([C@@H](CO)O)O)NC(=O)C)C(=O)O)O
Condensed IUPAC format: NeuAc(a2-3)Gal(b1-4)GlcNAc
LINUCS format: [][D-GlcpNAc]{[(4+1)][b-D-Galp]{[(3+2)][a-D-NeupAc]{}}}

I discuss the challenge of supporting a variety of biopolymer representations, handling chemically-modified structures, and handling biopolymers with unknown attachment points (e.g. from mass spectrometry).

References:
1. Zhang T, Li H, Xi Hualin, Stanton RV, Rotstein SH. HELM: A Hierarchical Notation Language for Complex Biomolecule Structure Representation. J. Chem. Inf. Model. 2012, 52, 2796-2806.
2. Bohne-Lang A, Lang E, Förster T, von der Lieth CW. LINUCS: Linear Notation for Unique description of Carbohydrate Sequences. 2001, 336, 1-11.

Return to Programme