Oral Presentation Abstracts


Ultrafast Shape Recognition for Similarity Search in Molecular Databases

Pedro J. Ballester and W. Graham Richards
Physical & Theoretical Chemistry Laboratory, University of Oxford, South Parks Road, Oxford OX1 3QZ, UK.

Molecular databases are routinely screened for compounds that most closely resemble a molecule of known biological activity to provide novel drug leads. It is widely believed that 3D molecular shape is the most discriminating pattern for biological activity, as it is directly related to the steep repulsive part of the interaction potential between the drug-like molecule and its macromolecular target. However, efficient comparison of molecular shape is currently a challenge. Here we show that a new approach [1,2] based on moments of distance distributions is able to recognise molecular shape at least three orders of magnitude faster than current methodologies. Such an ultrafast method permits the identification of similarly shaped compounds within the largest molecular databases. In addition, the problematic requirement of aligning molecules for comparison is circumvented, as the proposed distributions are independent of molecular orientation.

  1. Ballester, P.J. and Richards, W.G.: Ultrafast Shape Recognition to Search Compound Databases for Similar Molecular Shapes, Journal of Computational Chemistry (In Press)
  2. Ballester, P.J. and Richards, W.G.: Ultrafast Shape Recognition for Similarity Search in Molecular Databases, Proceedings of the Royal Society A (In Press)

Comparing Computational Approaches Over Multiple Datasets

Anthony Nicholls
OpenEye Scientific Software, 3600 Cerrillos Rd., Suite 1107, Santa Fe, NM 87507, USA

The evaluation of computational methods consumes considerable resources at both public and private organizations. Even so, the sharing and communication of the outcomes is often imprecise and statistically unsupported. I will present an approach, derived from ROC curves, that is both informative, statistically sound and applicable to a wide variety of studies, in particular virtual screening.


Virtual Screening Enrichment Studies: A Help or Hindrance in Tool Selection?

Andrew C. Good1 and Tudor Oprea2
1Bristol-Myers Squibb, Wallingford, CT 06492, USA
2Sunset MolecularDiscovery LLC, 1704 B Llano Street, Suite 140, Santa Fe, New Mexico 87505, USA

The literature is now awash in enrichment studies that attempt to provide insights into the relative merits of the myriad available virtual screening tools. Unfortunately suboptimal experimental design (e.g. poor target selection, limited active data set diversity, analogue-biased enrichment scores and unimaginative application of screening tools) compromises the utility of many of these studies. Issues of this nature are highlighted across a number of published targets data sets and tool comparisons. Suggestions are made regarding how to mitigate such problems going forward, including the distribution of a new public data set derived from the WOMBAT database to aid in direct head to head comparison.


HYDE: Towards an Integrated Description of Hydrogen Bonding and Dehydration

Ingo Reulecke1, Gudrun Lange2, Robert Klein2 and Matthias Rarey1
1Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, D-20146 Hamburg, Germany
2Bayer CropScience GmbH, Scientific Computing, Industriepark Hoechst, G836, D-65926 Frankfurt am Main, Germany

Recently published reviews [1-3] emphasize that the prediction of binding affinity represents a major concern in virtual drug-design. In order to address this problem, we developed a new empirical scoring function for the evaluation of protein-ligand complexes (HYDE). HYDE estimates the free binding energy based on two terms for dehydration and hydrogen bonding only. The essential feature of this scoring function is the integrated usage of accessibility dependent partial logP increments. These are used in order to calculate the free energy of dehydration of a respective inhibitor increment and also to estimate the contribution of polar increments towards an H-bond. While hydrophobic atoms contribute favorably when they are dehydrated, the dehydration of polar groups is initially penalized. This energy loss is compensated by the formation of an interfacial H-bond which has, according to our theory, a higher energy than the H-bond which a polar group forms to the water network.

First results show that the number of incorrect poses and false positive hits is dramatically reduced which leads among other things to a significantly improved enrichment in virtual screening.

  1. Leach, A. R.; Shoichet, B. K.; Peishoff, C. E., Prediction of protein-ligand interactions. Docking and scoring: successes and gaps. J Med Chem 2006, 49, (20), 5851-5.
  2. Sousa, S. F.; Fernandes, P. A.; Ramos, M. J., Protein-ligand docking: current status and future challenges. Proteins 2006, 65, (1), 15-26.
  3. Rarey, M.; Degen, J.; Reulecke, I., Docking and Scoring for Structure-Based Drug Design. In Bioinformatics - From Genomes to Therapies, Lengauer, T., Ed. Wiley-VCH: Weinheim, 2007; Vol. 2.

Molecular Docking for Substrate Identification: Lessons Learnt from the Family of Short-chain Dehydrogenase/Reductases

Angelo Favia, Irene Nobeli and Janet M. Thornton1
1European Bioinformatics Institute – EMBL, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
2Randall Division of Cell and Molecular Biophysics, New Hunt’s House, Guy’s Campus, King’s College London, London SE1 1UL, UK

§These authors have contributed equally to this work.

As structural genomics initiatives contribute to an ever-increasing number of protein structures, we have now access to many proteins for which the three dimensional arrangement of atoms is known, but the biochemical function is either totally, or partially, unknown. The functional information that can be derived from the amino acid sequence alone is limited, and hence structure-based prediction of the function of a protein is currently a hot topic, and one that is being addressed by many bioinformatics groups across the globe.

Molecular docking, the process of identifying and ranking the binding poses of a ligand to a protein is a technique that has long been established as a useful tool in drug design and lead identification. It is generally acknowledged that modern docking programs can help with the identification of good binders, even though they have had considerably less success in the prediction of actual binding energies for a variety of protein families. Although widely used in the search for inhibitors, docking is more recently being investigated as a tool for protein function identification. More specifically, recent publications have claimed considerable success in identifying both known and unknown substrates of proteins[1,2]. In our previous work, we had found it considerably harder to identify a protein’s substrate when cross-docking a large number of enzymes and their cognate ligands[3]. Hence, we decided to revisit this problem and examine the use of docking in identifying the substrates of a single protein family with a remarkable substrate diversity, the short-chain dehydrogenases/reductases (SDRs).

Here, we examine different protocols for identifying candidate substrates for 27 SDR proteins of known catalytic function. We present the results of docking a) the cognate substrates and products of these proteins, b) approximately 900 metabolites from the human metabolome and c) the whole of the KEGG Ligand database to each of these proteins. More specifically, we examine the ability of docking to a) reproduce a viable binding mode for the substrate, b) rank the substrate highly among the dataset of other metabolites, and c) provide us information about the nature of the substrate, based on the best-scoring metabolites in the dataset. We compare two different docking methods and two alternative scoring functions for one of the docking methods, and attempt to rationalise both the successful and failed cases. Finally, we introduce a new protocol, whereby we dock only a set of representative structures to each of the proteins, in the hope of reducing the computational cost resulting from having to dock a very large number of metabolites to each binding site. We compare the results from this protocol to our original docking experiments and find some promising results regarding the use of structural representatives, as opposed to very large datasets.

  1. Hermann J. C., Ghanem E., Li Y., Raushel F.M., Irwin J.J. et al. (2006) Predicting substrates by docking high-energy intermediates to enzyme structures. J.Am.Chem.Soc. 128(49): 15882-15891.
  2. Kalyanaraman C., Bernacki K., Jacobson M.P. (2005) Virtual screening against highly charged active sites: identifying substrates of alpha-beta barrel enzymes. Biochemistry 44(6): 2059-2071.
  3. Macchiarulo A., Nobeli I., Thornton J.M. (2004) Ligand selectivity and competition between enzymes in silico. Nat. Biotechnol. 22(8): 1039-1045.

The Use of Protein-ligand Interaction Motifs in Virtual Screening

Suzanne Brewerton
Astex Therapeutics Ltd., 436 Cambridge Science Park, Milton Road, Cambridge CB4 0QA, UK

The advent of high throughput crystallography means that drug discovery projects are generating structural data for large numbers of protein-ligand complexes. The growing databases of target-inhibitor complexes are an enormous repository of information that can feed back into future structure-based drug design to inform virtual screening and selectivity profiling experiments. Several groups have recently developed fingerprint approaches to express protein-ligand interaction data which can then be easily visualised and analysed simultaneously for large numbers of complexes. At Astex we have taken a different approach based on the observation that some combinations of interactions are seen often, while other combinations are infrequent. I will describe the analysis of these interactions and how they can be used to identify separate interaction motifs, rather than a single generalised fingerprint. The GOLD docking program has been modified to allow the use of several of these binding motifs simultaneously to guide the docking. A web-based interface for the analysis of interaction motifs using AstexViewer™ has been developed. We will demonstrate how these methods can be used in virtual screening experiments against several drug discovery targets.


Towards Understanding and Eliminating False Positives in Virtual Screening

Keith Davies
Treweren Consultants Ltd., Holmleigh, Evesham Road, Harvington, Evesham, WR11 8LU, UK

There are many programs which can dock molecules into receptor sites and as many scoring functions. For 80-90% of molecules that can be folded into the receptor site, most software incorrectly predicts inhibition. The approximations in the scoring functions are usually blamed. Our analysis of a subset of the 68 billion drug-like molecules and nearly 300 protein targets screened in the Find-a-Drug project, suggests that improving the docking algorithms and gaining a better insight into the interactions likely to be associated with inhibition delivers substantially better results. Most scientists would expect consideration of side-chain motion to help but we were surprised to discover for a diverse series of targets that inhibitors appear occupy an 'active volume' subset of the receptor site. Molecules that utilise space outside this 'active volume' do not exhibit activity despite being predicted to do so. This paper reports examples of up to 100% of false positives being eliminated!


Hit Explosion with a Novel Graph Matching Algorithm

Nicola Richmond
GlaxoSmithKline, Gunnels Wood Road, Stevenage, Herts, SG1 2NY, UK

For the large, corporate pharmaceutical industry, high throughput screening (HTS) is often a significant component of the lead generation process. Due to capacity constraints, together with the size and diversity of corporate collections, it is not always possible to progress a compound at every stage of the HTS cascade. Thus, following a successful HTS campaign, there is a greater need for fast, efficient algorithms with which to perform hit explosion.

We present a fast, novel, graph matching algorithm, based on the comparison of distance degree sequences, that can be applied to a number of data-mining problems. The algorithm matches pairs of nodes, one from each graph, by solving the linear assignment problem. The graph similarity is then given by the minimum cost associated with the optimal set of matching pairs of nodes.

In general, a similarity cost of zero indicates that the two graphs are isomorphic. However, this is not exclusively the case. We describe the algorithm, discuss examples where this characterisation of graph isomorphism is false and indicate how this algorithm may be easily adapted to solve problems encountered in chemoinformatics.

In particular, by representing molecules as 2D topological pharmacophores, we adapt the algorithm to rank a corporate collection against a query molecule of interest, and to cluster the ranked list into groups of compounds that have identical chemical graphs. We then analyse the list to demonstrate that the highest ranked compounds correspond to the analogues of the query, followed by families of lead hops, whilst highlighting the visualisation advantages gained from the clustering component.

Finally, we discuss how this purely unsupervised approach can follow our highly automated high throughput screening process to recover, not only families of compounds on which to build structure activity relationships, but also missed hits that, due to capacity constraints and quality issues, may have fallen out of the HTS cascade.


Identifying Maximally Enriched Scaffolds in HTS Data Sets

Martin Packer
AstraZeneca, Alderley Park, Cheshire, SK10 4TG, UK

Analysis of high throughput screening (HTS) data usually involves substructure analysis, to generate chemical start points for lead identification. A goal of substructure analysis is to identify highly enriched scaffolds, which show an enhanced level of activity relative to the overall hit rate for the assay. Enriched scaffolds can be identified by clustering on substructure and then extracting the maximal common substructure for each cluster. However, if clustering is performed without reference to the assay data, the resulting scaffolds are unlikely to show optimal enrichment for the assay in question. We have developed a method for locating scaffolds with high enrichment factors, utilising a hiearchical search strategy. The method can be applied to large HTS data sets (> 0.5 million compounds). The hierarchical nature of the search means that structure-activity relationships emerge for the most enriched scaffolds; this is of value in prioritising targets for chemical synthesis.


When Worlds Collide: Looking for Answers to "The Alignment Problem"

Robert D. Clark
Informatics Research Center, Tripos, Inc., St. Louis MO 63144, USA

Structure-based and ligand-based drug design often seem to represent mutually exclusive world views, with docking techniques preferred whenever a crystal structure is available. When no structure is available, researchers must fall back on pharmacophore matching or comparative molecular field analysis (CoMFA). Unfortunately, ligand binding often induces structural changes that significantly reduce the usefulness of apoprotein structures for docking and scoring. In such cases it is often better to dock into the binding site of a ligand-protein complex from which the ligand has been extracted in silico. Even when a naïve protein structure is suitable for docking, ligands can provide critical information about the location of the relevant binding site. Moreover, interactions with specific binding site residues illuminated by bound ligands have been successfully used to direct docking and to tailor scoring functions to specific target proteins. An extreme version of this is the use of docking to align molecules for CoMFA, which stands in contrast to "classical" substructural alignment after energy minimization. These alternatives will be contrasted with each other, with pharmacophoric alignment, and with a hybrid, ligand-centric approach wherein conformations are generated by docking but the frame of reference for the field analysis is based on shared substructures.


Charting the Chemical Space: a Rule-based, Hierarchical Method for the Classification of Chemical Structures by Scaffold

Ansgar Schuffenhauer1, Peter Ertl1, S. Roggo1, Stefan Wetzel2 and H. Waldmann2
1Novartis Institutes for BioMedical Research, CH-4002 Basel, Switzerland
2Max-Planck-Institute for Molecular Physiology, Otto-Hahn-Straße 11, D-44227 Dortmund and University of Dortmund, Germany

The increasing size of screening libraries and other chemical databases poses a challenge for data analysis and visualization techniques. While descriptor based clustering, principal component analysis and neuronal networks have all been used successfully, classes created by these methods often are often not intuitive in the view of a chemist and do not share a meaningful common scaffold. Classification of by molecular frameworks[1], obtained by removing all acyclic side-chains from a molecule, produces a flat classification where an addition of any cyclic subsitituent to structure changes its class membership, and therefore different members of the same combinatorial library created by the decoration of a common scaffold may still end up in different classes. However, molecular frameworks are a good starting point for a hierarchical, rule-based partition oriented on scaffolds.

Here we are presenting a hierarchical classification method for chemical structures using molecular frameworks as the leaf nodes of a scaffold tree.[2,3] By iterative removal of rings from the molecular framework, scaffolds forming the higher levels in the hierarchy tree are obtained. Removal of a ring in this context means to remove from the structure all atoms and bonds being member of the ring to be removed and no other ring. If the removal of a ring would lead to a disconnected structure, that ring cannot be removed. All possible parent scaffolds resulting from the removal of one of the other rings are prioritized by a set of rules in order to ensure that characteristic, non-trivial parts of the scaffold are retained and the scaffold is intuitive from a medicinal and synthetic chemistry point of view. The most prominent rules driving the classification are in order of precedence:

Since the classification process is done individually for each molecule, the computational effort to classify a dataset scales linearly with the number of structures.

The application of the classification is demonstrated on the pyruvate kinase dataset extracted from the PubChem database. Highlighting by color intensity is used to show the fraction of active compounds containing this scaffold. This way to visualize scaffold hierarchy is very intuitive, because color intensity coding immediately identifies those branches of the scaffold tree which present bioactive molecules.

The method can also be used to “reverse engineer” enumerated compound collections, such as those offered by various companies selling compound libraries for screening, and identify the combinatorial library scaffolds which have most likely been used. Due to the fact that the classification is data-set independent, it is possible to overlay scaffold trees of different datasets and to identify scaffolds present in one set, but missing in the other.

  1. G. W. Bemis, M. A. Murcko. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem., 1996, 39, 2887.
  2. Schuffenhauer A., Ertl P., Roggo S., Wetzel S., Koch M.A., Waldmann H. The Scaffold Tree - Visualization of the Scaffold Universe by Hierarchical Scaffold Classification. J. Chem. Inf. Model. 2007, 47, 47.
  3. Koch, M.A., Schuffenhauer, A., Scheck, M., Wetzel, S., Casaulta, M., Odermatt, A., Ertl, P, Waldmann, H., Proc. Natl. Acad. Sci. USA 2005, 102, 17272.

A Spectral Clustering Approach for the Analysis of Screening Data

Mark Brewer
Evotec (UK) Ltd., 111 Milton Park, Abingdon, Oxfordshire, OX14 4RZ, UK

Evotec has recently developed a novel spectral clustering method for molecular datasets[1]. High throughput in vitro and in silico screening campaigns frequently result in shortlists of compounds where cluster-based cheminformatics investigation can play an important role. For example, the primary hits from an in vitro screen can be clustered to elucidate early stage structure-activity relationships, identify compounds for hit confirmation and prioritise series for medicinal chemistry, while in silico hits can be clustered to ensure diversity among compounds that are earmarked for purchase and/or synthesis. The spectral clustering method has been found particularly useful in the analysis of these compound shortlists. The method provides a natural means to quantify the degree of similarity within a molecular cluster and also the contribution that a molecule makes to a cluster. These two criteria can be used to arrange molecules into clusters of chemically related molecules and quantify inter-cluster relationships so that the resultant classification scheme appears intuitive from a medicinal chemistry perspective. Details of the spectral clustering method will be presented and discussed along with illustrative examples and recent results.

  1. Mark L Brewer, Development of a spectral clustering method for the analysis of molecular datasets, J. Chem. Inf. Model. (submitted)

Modeling Chemical Reactions for Drug Discovery

Johann Gasteiger
Computer-Chemie-Centrum, Universität Erlangen-Nürnberg D-91052 Erlangen, Germany

Chemical reactions play a major role at many steps of the drug discovery process. A better understanding and modelling of chemical reactions could greatly increase the efficiency in developing a new drug.

In target identification, an understanding of enzyme reactions is needed. In lead discovery and lead optimization, an estimate of synthetic accessibility is desired, syntheses have to be designed, and the synthesis of a library asks for knowledge on the scope and limitations of a reaction type. Furthermore, knowledge on the stability of the compounds of a library is necessary.

The estimation of ADME-Tox properties has to model the metabolism of drugs and has to predict pKa values, both being chemical reactions. Furthermore, many toxic modes of action are the result of chemical reactions.

Examples for modelling these various types of chemical reactions will be given.


SyGMa: Systematic Generation of Metabolites

Markus Wagener and Lars Ridder
NV Organon, PO Box 20, 5340BH Oss, The Netherlands

The goal of early drug discovery is not only to identify potent and selective lead compounds, but also to ascertain that these lead compounds have a favorable pharmacokinetic profile. In recent years, much work has been reported on the in-silico modeling of so-called ADME (absorption, distribution, metabolism and elimination) properties. While these efforts have lead to numerous models describing the absorption of druglike compounds, much less attention has been paid to the prediction of metabolism.[1]

Here, we present the novel rule-based method SyGMa (Systematic Generation of Metabolites) that predicts potential metabolites of a given parent structure. The method is based on reaction rules derived from metabolic reactions reported in the Metabolite Database[2] to occur in man. The predicted metabolites are ranked according to an empirical probability score. Evaluation of the method demonstrated a significant enrichment of true metabolites in the top of the ranking list.

The current rule set of SyGMa covers ca. 70% of the human in-vivo data of the Metabolite Database. To better understand the nature of the metabolic reactions reproduced / not reproduced by SyGMa, a similarity analysis of the reaction types present in the database was performed.

Predictions of SyGMa are used at Organon to better plan experiments aimed at experimental metabolite identification and to suggest labile sites amenable for optimization by medicinal chemistry. Examples to illustrate these applications will be given.

  1. van de Waterbeemd H., Gifford E., Nat. Rev. Drug. Discov. 2003, 2, 192-204.
  2. The Metabolite Database is available from Elsevier MDL (www.mdli.com).

Probabilistic Approaches to Classifying Metabolic Stability of Early Drug Discovery Compounds

Anton Schwaighofer1, Sebastian Mika2, Timon Schroeter3, Antoniuster Laak4, Philip Lienau4, Andreas Reichel4, Nikolaus Heinrich4 and Klaus-Robert Muller1,3
1Fraunhofer FIRST, Intelligent Data Analysis Group (IDA), Kekul´estraße 7, 12489 Berlin, Germany
2idalab GmbH, Sophienstrasse 24, 10178 Berlin, Germany
3Technische Universit¨at Berlin, Department of Computer, Franklinstraße 28/29, 10587 Berlin, Germany
4Research Laboratories of Bayer Schering Pharma, M¨ullerstraße 178, 13342 Berlin, Germany

In drug development, suf?cient metabolic stability of a drug candidate is one of the prerequisites to overcome the obstacles of oral bioavailability. Optimally, it should be taken into account early on in the drug design process. Yet, general purpose predictive tools for this endpoint are inherently dif?cult to obtain.

In a joint project of Fraunhofer FIRST, idalab GmbH and Bayer Schering Pharma (BSP), we developed machine learning tools to predict the metabolic stability of compounds from the drug discovery projects at BSP. We consider experimental data of metabolic stability (given as percentage of compound remaining after incubation with liver microsomes for 30 minutes) for four different in vitro assays (human, male mouse, female mouse, male rat), with 1000 to 2100 data points per assay.

Our contribution describes the process of data analysis and model building. In particular, we describe results obtained with a variety of machine learning approaches. We compare the different approaches with respect to obtained performance, dif?culty of the model selection procedures, interpretability, and how the “domain of applicability” can be checked.

To check the domain of applicability, we consider different strategies: For methods such as Support Vector Machines, post-processing heuristics can be used. Strategies that are based on the distance to nearest neighbors can be applied generally, but we found that these are not very reliable. We also consider fully Bayesian methods that consider the “domain of applicability” implicitly (a Bayesian classi?cation method would output 50% for the probability to be stable, when the compound is outside the model’s domain of applicability).

From our investigations, we concluded that Bayesian classi?cation methods show a number of bene?cial properties: The efforts for model selection are minimal (numeric model selection can be used to choose all relevant parameters), and thus allow also for a fully automatic re-training. Furthermore, the probabilistic output is easy to interpret and showed almost ideal properties. Competing methods achieve similar performance, but need a more careful tuning by an expert user.

The developed models were validated on recent project data of BSP (data from 200 to 700 compounds, depending on the assay, that have not been available to the modeling team at Fraunhofer FIRST). This showed that the best models are both highly accurate and are able to correctly identify the domain of applicability. The developed models are now fully integrated in the working environment at BSP. A tool for fully automatic regular re-training of the models is currently being implemented.


Fig. 1: ROC-Curves for validating the developed models on data from recent projects that was not used for model building. Left: A fully Bayesian classifcation model that predicts the metabolic stability for female mouse. 258 unambiguous validation measurements, 156 of which are in the domain of applicability. Right: Model for male rat. 183 unambiguous validation measurement, 89 of which are in the domain of applicability.

Automatic Model Building Process

Joelle Gola, Olga Obrezanova and Matthew Segall
BioFocus DPI, 127 Cambridge Science Park, Milton Road, CB4 OGD, Cambridge, UK

The use of in silico models to predict ADMET properties is now widely accepted as part of the drug discovery process. A number of off-the-shelf models are widely available and it is very common for a medicinal chemist to make use of these models, alongside their own proprietary models, to identify the potential ADMET hurdles they might face for their different chemical series.

However, in silico models have limitations and often the predictive ability of the models is reported as their major drawback. Usually this factor is a result of the range of data available when the model was built. Typically off-the-shelf models will be “global” models. “Global” models are built using as diverse a set of molecules as can be found so that they can be applied to a large range of chemistries. As such, it is likely that “global” models will pick up long range trends in properties between diverse molecules. However, this also means that they are unlikely to account for smaller differences in properties between molecules within the same chemical series.

In order to study and predict smaller property differences across a set of structurally similar molecules it is necessary to build “local” models. A “local” model will be built using a set of structurally similar molecules and therefore be more predictive for further similar molecules. The rapid design-test-redesign cycles in drug discovery mean that for a chemical series the data available for model building will be growing steadily and therefore local models will need to be continuously updated.

Model building has traditionally been a lengthy process of data curation, data set selection and finally testing different mathematical approaches to building the model. This is often at odds with the need to rapidly build and rebuild models as a chemical series evolves.

We present through case studies our new automatic techniques for model building that enable non-computational scientists to capture and share the knowledge contained in their experimental data by:

Providing guidance as to why molecules have certain predicted properties and as to how to redesign these molecules to improve the property values.


The Discovery Bus: a Novel Software System for Automating QSAR Modelling

Damjan Krstajic1, John Cartmell2 and David E. Leahy3
1Research Centre for Cheminformatics, Belgrade, Serbia
2Cyprotex Discovery Limited, 13-15 Beech Lane, Macclesfield, Cheshire, SK10 2DR, UK
3University of Newcastle, Newcastle-upon-Tyne, UK

There are many different methods and approaches to solve QSAR problems. Regardless of our preference to a set of techniques, practice teaches us that there is no single bullet proof way to solve QSAR problems. On the other hand, in the age of HTS and highly automated laboratories there is a bombardment of new data and experimental results. The challenge is to design a system that will cope with constant influx of new information and to automate QSAR modelling without sacrificing the quality of predictions.

Discovery Bus is our solution to the challenge. It is an implementation of Competitive Workflow, a novel software architecture, implemented using autonomous software agents. It takes advantage from both workflow and multi-agent systems and has significant additional advantages. In particular, all possible combinations of components are explored leading to exhaustive evaluation of potential solutions. Our starting premise was that we cannot know in advance which technique or approach to use in solving a QSAR problem. However, if we wisely apply most of the well known techniques and approaches to a QSAR problem then we will have an explosion in number of models, but we will also end up with multiple good solutions amongst them. The ability to create and test huge number of models in matter of hours provided us with an opportunity to easily test new approaches with our variations.

During this presentation we will describe Discovery Bus in more details, show the results of models created on Discovery Bus using previously published datasets and explain certain non-standard approaches that have performed well in practice.

  1. J.Cartmell, S. Enoch, D. Krstajic, D.E. Leahy, “Automated QSPR through Competitive Workflow”; Journal of Computer-Aided Molecular Design (2005) 19: 821-833

Fragment-based De Novo Design

Christian Lemmen, Marcus Gastreich, and Holger Claussen
BioSolveIT, An der Ziegelei 75, 53757 St.Augustin, Germany

We designed and implemented a suite of software tools for the purpose of fragment-based de novo design. Fragment-libraries can be generated using CoLibri by either assembling collections of combinatorial chemistries or by shredding compounds according to rules of retro-synthetic analysis. Fragment-libraries can be browsed, enumerated, filtered, selected and searched using either a ligand- or a structure-based approach.

Using the ligand-based approach, a query molecule is taken and on the basis of fragment libraries, virtual compounds are generated on-the-fly using the similarity-principle as implemented in the FTrees descriptor. This way molecules can be detected in vast combinatorial spaces that exhibit physico-chemical properties in topologically similar arrangements as compared to the query molecule.

FlexNovo follows the structure-based approach. First all fragments are exhaustively docked to find suitable low-affinity anchors with potential to be further expanded into lead compounds. Following an assessment-step the highest-scoring candidates are grown within the context of the active site in successive expansion cycles.

CoLibri, FTrees-FS and FlexNovo have been used to design potential inhibitors for different targets of pharmaceutical interest. In all cases chemically reasonable structures were generated that have similar structural motifs when compared to known active structures.


SPROUT LeadOpt: Protein Structure Based Lead Optimisation

A. Peter Johnson, Krisztina Boda, Philip Bone, Shane Weaver, Aniko Valko and Vilmos Valko
School of Chemistry, University of Leeds, Leeds LS2 9JT, UK

De Novo design is an important computational tool in protein structure based drug design in which structures are built up by linking together small 3D fragments within the binding cavity of the protein target. True de novo design systems, like SPROUT, can generate many structurally diverse ligands which are predicted to bind tightly to a target protein. Synthetic accessibility of these hypothetical ligands is an important factor in choosing between alternative hits and a number of different methods have been developed to address this problem. Lead optimisation is a different, though no less important problem in structure based drug design, and a variant of SPROUT has been developed to address this problem. SPROUT LeadOpt, employs both retrosynthetic analysis and knowledge of available starting materials to suggest analogues of leads with improved predicted binding affinity. It comprises two modules – Core Extension and Monomer Replacement to provide alternative modes of lead optimisation. The way in which these systems operate will be presented, along with experimental results relating to the application of these systems to the design and synthesis of specific enzyme inhibitors.


Maximum Common Binding Modes: a Novel Consensus Scoring Concept using Multiple Ligand Information

Steffen Renner1,2, Swetlana Derksen2,3, Sebastian Radestock2,3, Tanja Weil2
1Max-Planck-Institut für molekulare Physiologie, Dortmund, Germany
2Merz Pharmaceuticals, Frankfurt am Main, Germany
3Johann Wolfgang Goethe-Universität, Frankfurt am Main, Germany

Improving the scoring functions for small molecule-protein docking is a highly challenging task in current computational chemistry. Recent advances addressed the integration of additional available information like found, e.g. in consensus scoring, or by using the similarity of the binding pattern of docked molecules to the patterns of co-crystallized reference ligands, like it is realized in the concept of structure interaction fingerprints (SIFt) [1]. In ligand-based virtual screening, the use of multiple active reference structures is based on the same concept [2, 3].

Here, we present a novel consensus approach for the prediction of binding modes for multiple known active ligands with unknown binding mode. The presumption of our approach was that structurally similar ligands for the same receptor should be likely to bind in a similar manner. A recent study just confirmed that in most cases similar ligands occupy similar sites at the receptor. [4] Here, we used the size of the common (consensus) pattern of interactions between ligand and receptor that was found between the set of similar ligands to score docking poses. Patterns of interactions were modeled using ligand receptor fingerprints.

We evaluated our approach using four different datasets of known co-crystal structures (Thrombin, CDK-2, Dihydrofolate Reductase, and HIV-1 protease). Docking poses were generated with FlexX and rescored using our approach. For comparison, we also rescored the docking solutions using the scoring functions offered by the CScore module of Sybyl (Dock Score, Gold Score, Chemscore and PMF Score) and by consensus scores thereof. In three of the four test sets our approach was found to outperform the other methods and identified poses for all ligands within an RMSD threshold of 2.5 Å.

  1. Deng, Z., Chuaqui, C., Singh, J., J. Med. Chem. 2004, 47, 337-344.
  2. Hert, J., Willett, P., Wilton, D.J., Acklin, P., Azzaoui, K., Jacoby, E., Schuffenhauer, A., J. Chem. Inf. Comput. Sci. 2004, 44, 1177-1185.
  3. Renner, S., Schneider, G., J. Med. Chem. 2004, 47, 4652-4664.
  4. Boström, J., Hogner, A., Schmitt, S., J. Med. Chem. 2006, 49, 6716-6725

Chemoinformatics in ABCD: Where Substance Meets Style

Dimitris K. Agrafiotis
Johnson & Johnson Pharmaceutical Research & Development, LLC, 665 Stockton Drive, Exton, PA 19341, USA

This talk will highlight recent developments in ABCD, an integrated drug discovery informatics platform developed at J&JPRD. ABCD represents a new vision for discovery informatics that goes far beyond the loose "plumbing" of data systems and applications that one typically encounters in the pharmaceutical industry. The system consists of three major components: 1) a data warehouse, which combines data from multiple chemical and pharmacological transactional databases, designed for supreme query performance; 2) a state-of-the-art application suite, which facilitates data upload, retrieval, mining, and reporting, and 3) a workspace, which facilitates collaboration and data sharing by allowing users to share queries, templates, results and reports across project teams, campuses, and other organizational units. Chemical intelligence, performance, and analytical sophistication lie at the heart of the new system, which was developed entirely in-house. ABCD is used routinely by more than 850 scientists around the world, and is rapidly expanding into other functional areas within J&J. Following the successful release of the data warehouse, we have now embarked on the development of a new global transactional system that will replace the legacy operational data stores. This presents us with several compelling advantages: a common ontology used across the transactional and decision support layers, a simpler, more streamlined and more robust ETL subsystem, and a radically different end-user experience through the use of a single, unified application front-end. Indeed, ABCD is the only system of its kind that will utilize a common framework for the entire discovery data life-cycle, including processing, upload, mining, analysis, visualization and reporting. Our approach departs from "conventional wisdom" in several important respects: 1) it does not rely on the web as a delivery medium; 2) it does not use federation for data integration, 3) it does not use a 3-tier architecture, 4) it does not necessitate multiple applications to manage and interact with the data, and 5) it does not follow the common practice of purchasing or outsourcing the development of core enterprise systems for something as complex and critical as drug discovery.


Exploring Virtual Compound Space with Bayesian Statistics

Willem van Hoorn
Pfizer, Ramsgate Road, Sandwich, Kent, CT13 9NJ, UK

The Pfizer Virtual Library (PVL) consists of all compounds that could be made from available monomers and validated library chemistry protocols. It is currently estimated to be in the order of billions of compounds. In the process of drug discovery it is often desirable to make multiple analogues around a given compound, and library chemistry is an attractive method to do that. However, the PVL is too large to search exhaustively by brute force, and contains too many protocols for any human to have a complete overview. How to find the most appropriate protocol? Bayesian learning has been applied to build a model that encapsulates what distinguishes compounds made by protocol X from compounds made by protocol Y. This model is used to rank the available protocols by likelihood that compounds made by each protocol resemble the input compound. The top 16 protocols are further mined by a nearest neighbour search of the input compound against the compounds made in the past with the highly ranked protocols. The nearest neighbours are probably not the most similar compounds that could have been synthesised, but they provide the medicinal chemist with a handle to judge the appropriateness of each protocol, and in contrast to virtual compounds, they can be submitted to assays.


Interpretable Correlation Descriptors and Compression-based Similarity: New Open Source Software

Jonathan D. Hirst and James L. Melville
School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, UK

Highly predictive Topological Maximum Cross Correlation (TMACC) descriptors for the derivation of quantitative structure-activity relationships (QSARs) are presented, based on the widely used autocorrelation method. They require neither the calculation of three-dimensional conformations, nor an alignment of structures [1]. Open source software for generating the TMACC descriptors is freely available from our website: http://comp.chem.nottingham.ac.uk/download/TMACC. Software is also available for compression-based similarity (http://comp.chem.nottingham.ac.uk/download/zippity). The latter implements a simple and effective method for similarity searching in virtual high throughput screening, requiring only a string based representation of the molecules (e.g. SMILES) and standard compression software, available on all modern desktop computers [2]. Both approaches have been validated on a range of datasets and benchmarked against contemporary methods.

  1. Melville J.L. and Hirst J.D., TMACC: Interpretable correlation descriptors for quantitative structure-activity relationships. J Chem Inf Mod, in press (2007). DOI: 10.1021/ci6004178
  2. Melville J.L., Riley J.F. and Hirst J.D., Similarity by compression. J Chem Inf Mod, 47, 25-33 (2007). DOI: 10.1021/ci600384z

A Web Service Infrastructure for Chemoinformatics

David J. Wild
Indiana University, Bloomington, IN 47401, USA

In this talk we will describe work we have done at Indiana University to create a comprehensive web service infrastructure for chemoinformatics. We will demonstrate a variety of aspects of its use, including "mash-up" applications, and text mining of web and journal article documents. We will discuss the possibilities created by applying computational infrastructure to the widest possible amount of chemical and related information, including smart search engines.