3rd Joint Sheffield Conference on Chemoinformatics

21st-23rd April, 2004

Oral Presentation Abstracts


Better Clusters Faster

John M. Barnard*, Geoff M. Downs, David J. Wild and P. Matt Wright
Barnard Chemical Information Ltd,46 Uppergate Road, Sheffield, S6 6BX, UK

During the past ten years two clustering methods have been widely used for large chemical datasets, Ward’s (hierarchical agglomerative) and K-means (non-hierarchical), which have to a considerable extent supplanted the previously-popular Jarvis-Patrick (non-hierarchical) method. Studies have shown that Ward's generally produces better groupings than K-means, but its time requirements are an order of complexity greater making it impractical, at least in serial-processing implementations, for the datasets of millions of compounds that now commonly require clustering.

In this study, we investigate a new and faster hierarchical clustering method, Divisive K-means, and a new method, Stepped Partitioning, for selecting optimal sets of clusters from a hierarchy. The efficiency of these methods, in terms of processing time and memory requirements, will be discussed along with quantitative and qualitative assessments of their effectiveness in grouping compounds into appropriate clusters. Ways in which these two methods can be used alongside existing methods to improve the clustering of very large datasets will be examined. We will also discuss the use of parallel hardware to cluster extremely large datasets.


Similarity Searching Employing Surface Fingerprints – Using Spatial Data for Meaningful Feature Selection

Andreas Bender*, Hamse Y. Mussa, Gurprem S. Gill and Robert C. Glen
Unilever Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK

A novel technique for similarity searching is introduced. Molecules are represented by surface point environments, which are fed into an information-gain based feature selection. A Naïve Bayesian Classifier is then employed for compound classification.

The new method is tested by its ability to retrieve five sets of active molecules seeded in the MDL Drug Data Report (MDDR). For a variety of data sets, QSAR models based on significant surface point environment descriptors are developed.


Exploring Binding Site Similarity with CavBase: Current Achievements and Future Challenges

Andreas Bergner
Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, UK

It is commonly assumed that proteins with similar functions show a significant degree of similarity in their binding sites. A method for detecting binding site similarities, CavBase, has been developed by Schmitt et al. (J Mol.Biol. 323, 387-406, 2002). CavBase uses 3D descriptors, referred to as pseudo-centres, for characterising the physico-chemical properties of cavities which are incorporated into the protein surface. An approach based on graph theory is employed for finding similarities amongst various cavities, and the degree of similarity is estimated using a simple scoring function.

The CavBase methodology will be introduced briefly, followed by a presentation of recent developments, in particular the similarity scoring function. A case study investigating the analysis of the similarity of ATP and PLP binding sites will be shown. Using a test set taken from the Relibase+ protein structure database, searches for similar binding sites and subpockets will be discussed. The use of an enhanced similarity scoring function allows for improved discrimination between similar functional binding sites and unrelated binding sites.


Complexity Analysis of De Novo Designed Ligands

Krisztina Boda* and A. Peter Johnson
ICAMS School of Chemistry, University of Leeds, UK

The de novo approach to rational drug design offers a powerful tool to suggest entirely novel potential leads in cases where other drug design methods fail.

However programs for structure generation typically produce large numbers of putative ligands, therefore various heuristics (such as estimation of synthetic accessibility and binding affinity) have to be adopted to evaluate and prune large answer sets with the goal of suggesting ligands with high affinity but low structural complexity. We have developed a method for complexity analysis which provides a fast and effective ranking technique for elimination of structures with unfeasible molecular motifs.

The complexity analysis technique, implemented for the SPROUT de novo design system, is based on the statistical distribution of various cyclic and chain structural motifs and substitution patterns in existing drugs. The complexity score for a putative ligand is calculated by matching structural features against those of compounds in a database of drug molecules.

The novel feature of our technique that distinguishes it from other published methods is that the matching takes place at various levels of abstraction, so that it can evaluate complexity scores for partially substituted structures which is the case in SPROUT where molecular structures are built in stepwise fashion utilising a small library of generic and specific building blocks.

The presentation will include examples of applications of the technique to problems in structure based drug design.


The Use of Rapid 2D Design Methods Within a Design-to-Delivery Software Suite

Susan M Boyd1* and Michael Snarey2
1 Scynexis Europe Ltd, Fyfield Business and Research Park, Fyfield Road, Ongar, Essex, CM5 0GS, UK
2 Independent consultant, Eastry, Kent, UK

The improvement in 2D virtual screening techniques in recent years has furnished significant enrichment in activity prediction across both gene families, and across specific biological targets. In-house development of a rapid enumeration tool has allowed access to a large virtual pool of synthesisable compounds. Using efficient, 2D methods, compound subsets from this pool can be focussed towards either general gene family activity or towards target-specific efficacy. Some examples using recursive partitioning methods and nearest-neighbours similarity searching will be described. Scynexis’ HEOS software (Hit Explorer Operating System) for product lifecycle tracking (design-to-delivery) can harness these techniques to select compounds for synthesis, or to derive novel chemistry ideas. Compounds designed using HEOS can then be tracked through synthesis, purification & analysis via this integrated platform.


Evolving Median Molecules on the Pareto Frontier

Nathan Brown1*, Ben McKay1 and Johann Gasteiger2
1 Avantium Technologies B.V., P.O. Box 2915, 1000 CX, Amsterdam, The Netherlands
2Computer-Chemie-Centrum and the Institute for Organic Chemistry, University of Erlangen-Nürnberg, Nägelsbachstraße 25, D-91052, Erlangen, Germany

The vast size of chemistry space necessitates the development of effective search and sampling algorithms that filter and rationalise the entire space to a subset of structures that is likely to be of most interest to the current research. When presented with a set of interesting molecules, it is often desirable to automatically generate additional structures that exhibit similar characteristics.

We propose a novel graph-based genetic algorithm, Compound Generator (CoG), which operates on meta-molecules rather than the molecules themselves allowing the complexity of the molecular building blocks to be as simple or as complex as required by each project. To demonstrate the effectiveness of CoG we apply it to the de novo evolution of sets of median molecules that lie on the Pareto frontier between the existing objective molecules as a possible approach to filling interesting regions of chemistry space.


Application of Novel Structural Fingerprints and Bayesian Learning to HTS Data Mining and Screening Prioritization

Robert Brown* and David Rogers
SciTegic Inc. 9665 Chesapeake Drive, Suite 401, San Diego, CA 92069, USA

Analysis of high-throughput screening data is complicated by a number of real-world problems including the low hit rate, hits spanning multiple modes of activity and significant levels of noise (i.e. false positives and false negatives). We have found that the use of a Bayesian learning method in combination with an “extended connectivity” fingerprint is well able to handle such data.

We will discuss our novel fingerprinting algorithm that can very rapidly characterize a molecule using a sparse, 4-billion bit fingerprint. We will describe a validation study showing that the use of the Bayesian learning in combination with these fingerprints is able to build predictive virtual screens, even with the limitations of the real-world data we described. We will discuss the application of these virtual screens to compound and library selection and screening prioritization.


Characterization of Pharmacophore Multiplet Fingerprints as Molecular Descriptors

Robert D. Clark*, Peter C. Fox and Edmond Abrahamian
Tripos, Inc., 1699 S. Hanley Road, St. Louis, MO 63144, USA

Pharmacophore analysis has been a mainstay of drug development for many years, particularly in connection with flexible 3D searching. More recently, pharmacophore fingerprints (bitstrings) based on the presence or absence of triplets or quartets of pharmacophoric features have been used for diversity analysis and drug design. We have devised a method that creates, stores and operates on a compressed representation – bitmaps – rather than bitstrings. This approach makes storage and manipulation faster and more efficient, thereby allowing their use in a wider range of studies than was practical with earlier implementations. Our focus has been on using the pharmacophore bitmaps to measure similarity rather than diversity, providing a 3D database searching technology complementary to more classical flexible searching.

Pharmacophore multiplets are defined in terms of the feature types found at each vertex and the length of the edges separating each pair of vertices. All methods for generating pharmacophore fingerprints involve binning the edge lengths across distance ranges in some way, but it is not obvious how the bin boundaries should be set. To address this question, we have taken the data set originally compiled by Mannhold et al. (J. Pharma. Sci. 1995, 84, 1410-1419) for use in evaluating the accuracy of logP prediction across pharmacological classes, augmented it with a set of steroid antagonists (Waskowycz et al., IBM Systems J. 2001, 40, 360-376), and examined the distribution of inter-feature distances obtained in various situations. The amount of fine structure seen in such profiles varies considerably with the number of conformers considered and of the method used to generate those conformers, but the extent and positioning of “natural” clusters is surprisingly consistent across classes. A “natural” binning scheme was formulated based on these results, along with corresponding weights appropriate for generating hypotheses for use in database searching.

The similarity between molecules increased as conformer count increased, regardless of the similarity coefficient (Tanimoto, cosine, or stochastic cosine) used. This increase in similarity takes place between pharmacological classes as well as within them, however, and it turns out that discrimination peaks at unexpectedly low levels of conformational sampling - less than 100 conformers per molecule, in fact. Systematic search generally yielded the greatest discrimination for individual drug classes but was least consistent across classes, whereas random torsional sampling (directed tweak) was less discriminating but more consistent. CONFORT conformers fell between the two extremes. Bitmap hypotheses are used to build consensus maps from a set of molecules active against a particular biological target for use in database searching. Similarity to molecules within the same pharmacological class as the training set used peaked when only the most discriminating bits were included, whereas similarity to drugs in other classes increased as the number of bits included in the query increased. An asymmetric variant of the stochastic cosine similarity coefficient out-performed the cosine coefficient in this application, with peak discrimination seen when 100 bits or fewer were set in the hypothesis. Increasing the number of conformers considered generally decreased discrimination.


Prediction Model Building Based on Classifying Compounds by Structural Features

Chihae Yang, Kevin Cross*, Paul Blower and Glenn Myatt
LeadScope, Inc., 1245 Kinnear Road, Columbus, Ohio 43212, USA

A new approach for building predictive models based on structural features using 2-dimensional molecular descriptors is presented. Our modeling strategy includes:
1) diagnosis of data set- structural diversity and endpoint distribution;
2) extraction of structural rules with predictive accuracy;
3) selection of significant structural features in combination with physicochemical properties including QSAR properties;
4) reduction of dimensionality;
5) employing appropriate model building algorithms;
6) and evaluation of the model with chemical inference.

This paper describes a methodology where structural features of medicinal chemistry building blocks are used as basis for the fingerprints. The importance of chemical inference is emphasized throughout the model building process. Discriminating structural features describing the compounds in the local neighbors are used as predictors for building QSAR models. New methods for selecting features and building subsequent models will be presented. The models were derived from the inhibition of protein tyrosine phosphatase 1B (PTP 1B) by benzofuran and benzothophene biphenyls and compared with CoMFA studies.


Finding Cancer Growth Inhibitors Using the Internet

Keith Davies
Treweren Consultants Ltd, Evesham, WR11 8LU, UK

Virtual Screening has the potential to significantly reduce the costs and timescales for Lead Generation compared with High Throughput Screening (HTS). The benefit is greatest when many millions of molecules can be screened with reliability/accuracy not significantly worse than primary HTS. Distributed computing using the Internet can harness millions of PCs to solve suitable computational problems. The THINK software has been used to screen 3.5 billion for each of 12 anti-cancer targets in the Oxford University CAN-DDO project and more targets in the on-going Find-a-Drug project. This paper discusses the analysis of the vast numbers of hits generated by large scale Virtual Screening and suggests an approach for identifying the best hits. Some initial results using cell-line assays will be presented that indicate higher than expected accuracy of the predictions.


A Combinatorial Docking Approach for Dealing with Protonation and Tautomer Ambiguities

Ingo Dramburg*, Jens Sadowski, H. Claussen, M. Gastreich and C. Lemmen
BioSolveIT, GmbH, An der Ziegelei 75, 53757 Sankt Augustin, Germany

One major problem in structure based drug design is the lack of certainty with respect to protonation and tautomeric states. On the protein side, X-ray crystallography usually does not provide sufficient detail for the determination of protonation states or to differentiate nitrogen and carbon atoms from electron densities. As a consequence, the orientation of nitrogen atoms in the side chains of Arg, Gln and His is frequently wrong. On the ligand side a protonation state is usually chosen based on the assumption of an aqueous solution surrounding the ligand, which is questionable for basic or acidic residues in different local chemical environments. Clearly, the in-silico prediction of protein ligand complexes is highly dependent on correct proton assignments.

One major step in solving this problem is to try a multitude of plausible configurations and determine the best overall solution by score. In order to do so the FlexX docking suite has been augmented by a novel methodology.

On the small molecule, sites are identified which can change protonation state. Based on this information and a number of general rules, a combinatorial "pseudo library" is automatically generated for the ligand. For the assembly of the members of this library within the active site, we take advantage of the combinatorial docking approach in FlexX-C1. This way, all solutions for all protonation states derived from a single ligand are evaluated on-the-fly and scored all together. On the protein side, we use the ensemble module in FlexX (Flex-E2) to automatically vary the proton assignment for selected amino acids.

Besides the describing the machinery we will provide performance figures comparing this approach to the current state of the art in docking. The performance analysis is based on benchmark datasets for cross docking and virtual screening studies alike.

References:
1. Rarey M., Lengauer T., 'A recursive algorithm for efficient combinatorial library docking.' Perspectives in Drug Discovery and Design, 20:63-81, 2000
2. Claussen H., Buning C., Rarey M., Lengauer T., 'FlexE: Efficient Molecular Docking Considering Protein Structure Variations'


A New Paradigm for Virtual Screening

Martyn Ford1*, Tim Clark2, Graham Richards3 and Jonathan Essex4
1Centre for Molecular Design, University of Portsmouth, UK
2Computer Chemistry Centre, University of Erlangen, Germany
3Department of Chemistry, University of Oxford, UK
4School of Chemistry, University of Southampton, UK

There is a need to create a fundamentally new technology for EU industries that screen large numbers of molecules for new, improved products. Such industries include pharmaceuticals, the food industry, fine chemicals and agrochemicals. This conclusion is based on two important observations.
(i) Although quantum theory provides the most general and accurate method for calculating the properties of molecules, it has not been used routinely by industry for this purpose, even though the growth in computer processing power and the development of highly optimised software techniques now make this approach feasible.
(ii) For the past decade, the pharmaceutical industry has used High Throughput Screening (HTS), based on a combination of information technology, robotics and automation, to support its drug discovery programmes. However, HTS has so far failed to deliver the expected increase of drugs in the pipeline.

This paper will describe a new technology for virtual (in silico) screening, to provide an accurate prediction of molecular properties that can then be used to optimise the screening process. Rather than calculating properties based on unrealistic 2D or 3D atomic representations of chemical structure, it is proposed to introduce new descriptors characterised by a set of properties defined on a molecular surface. This surface will need to be defined for each accessible conformation of a molecule. The descriptors will be calculated using accurate quantum mechanical methods that will determine the observed range of molecular interactions. It is proposed that four properties are sufficient to define these interactions and that these form a set of co- ordinates that define a low dimensional chemical-hyperspace. This space can be investigated to obtain either local or generalised models for predicting the biological potencies of compounds.

Potency depends on the inherent biological activity, ADMET, pharmacokinetic and pharmacodynamic properties, all of which must be predicted in silico. The new technology will use the molecular electron density to define the characteristics of objects within the chemical-space. This central, ground-breaking model is applicable to modelling chemical and biological systems using docking/scoring and biomolecular simulation as well as to cheminformatics, data mining and quantitative structure-property (QSPR) and activity (QSAR) models.

In the second context, it is especially suitable to pattern recognition and artificial intelligence techniques or multivariate statistics, which can all be used to develop robust relationships between the molecular properties and the biological responses to drugs and agrochemicals, etc.

To achieve this, existing in silico technology has been transferred from disciplines such as mathematics, information technology, process engineering and manufacturing. As a result of this basic technology, improved research and development will be possible not only for the pharmaceutical industry, but for a wide range of other industries that develop products using screening techniques.


Overlap Analysis of Compound Collections – Strategies From a Recent Acquisition

Michael F.M. Engels*, Alan Gibbs, Ed Jaeger and Dimitris Agrafiotis
Johnson and Johnson Pharmaceutical Research and Development and 3-Dimensional Pharmaceuticals, Raritan, New Jersey 08869, USA

The integration of a large chemical library into an existing collection is a complex task which affects several business critical systems within a research organization. Besides chemical registration and compound logistics, several other systems such as high-throughput screening and external compound acquisition are affected as well. A question of strategic importance is which compounds within the common library should become available for routine biological screening. This question cannot be answered effectively without in-depth knowledge of the overlap between the two libraries.  

In the context of the recent acquisition of 3-Dimensional Pharmaceuticals (3DP) by Johnson & Johnson PRD (JNJ), we conducted a series of chemoinformatics analyses aimed at comparing the contents of the JNJ and 3DP compound collections and augmenting the JNJ corporate screening deck with diverse drug-like compounds from the 3DP archive. In this talk, we describe business rules for establishing a common analysis framework, classical and novel statistics for representing the overlap between the two chemical libraries, and algorithms for selecting a subset of compounds which enhance, in some prescribed manner, the information content and quality of the JNJ high-throughput screening deck. The study was aimed at obtaining "soft" key indicators that could be used to assess the impact of the merger on several business-critical systems within PRD, and to support priority setting and resource planning for the laborious and time-consuming integration process. This report is presented in the hope that future mergers of this kind will benefit from the knowledge and experience obtained in this study.


Optimization of Ligand-Based Virtual Screening Protocols: Enrichments You Can Count On.

Andrew C. Good1* and Sung-Jin Cho2
1Bristol-Myers Squibb, 5 Research Parkway, Wallingford, CT 06492, USA
2Amgen Inc., One Amgen Center Drive, Thousand Oaks, CA 91320-1799, USA

Many different 2D and 3D fingerprints have been created over the years as descriptors for ligand-based virtual screening. Typically, however, only limited efforts have been made to compare their utility, often with only a single target being used in analyses against selected templates in which headline hit rates are used as quality measures. Here we describe a new twist on 3D pharmacophore descriptors, replacing simple binary fingerprints with pharmacophore contributions normalized by frequency of occurrence during conformer generation and filtered to remove descriptor “noise”. To properly validate the efficacy of these and other descriptors in virtual screening, an extensive enrichment analysis has been undertaken over 4 proteins from different target classes. This has been accomplished using a modified enrichment score designed to measure the ability to recover novel scaffolds as opposed to simple analogues. Rather than using selected probes, all molecules in a given data set are used in turn as templates. Each is assigned to a given chemotype based on the presence of key substructures, and only those molecules found during a search which represent a different chemotype are included in the enrichment score. The results obtained from these studies are discussed, as are the implications for technique validation of other computational approaches.


Design of a Compound Screening Collection for Use in High Throughput Screening

Gavin Harper*, Stephen D. Pickett and Darren V.S. Green
GlaxoSmithKline, Gunnels Wood Road, Stevenage, SG1 2NY, UK

In this paper we introduce a quantitative model that relates chemical structural similarity to biological activity, and in particular to the activity of lead series of compounds in high-throughput assays. From this model we derive the optimal screening collection make up for a given fixed size of screening collection, and identify the conditions under which a diverse collection of compounds or a collection focusing on particular regions of chemical space are appropriate strategies. We derive from the model a diversity function that may be used to assess compounds for acquisition or libraries for combinatorial synthesis by their abili ty to complement an existing screening collection. The diversity function is linked directly through the model to the goal of more frequent discovery of lead series from high-throughput screening. We show how the model may also be used to derive relationships between collection size and probabilities of lead discovery in high-throughput screening, and to answer a broad class of other general and specific strategic questions.


Application of BCUT Values to Virtual Screening

Uta Lessel
Boehringer Ingelheim Pharma KG, Department of Lead Discovery - Chemoinformatics, Ingelheim, Germany

BCUT values developed by Pearlman and Smith at the University of Texas are often used for diversity issues as well as for the design of focussed combinatorial libraries. At Boehringer Ingelheim we aim at analysing the suitability of BCUTs for different virtual screening tasks. For this purpose several case studies were set up. We determined enrichment factors which can be achieved defining similarities in low-dimensional BCUT spaces, analysed the results and compared them to those from other similarity measures. Results of these studies will be presented. Furthermore simple virtual screening procedures will be shown exploiting those findings.


Clustering Ambiguity: An Overview

John D. MacCuish* and Norah E. MacCuish
Mesa Analytics & Computing, LLC, 212 Corona St., Santa Fe, NM 87501, USA

Binary similarity measures operating on fixed length binary fingerprint produce a finite number of possible similarity values, dependent upon the measure and the length of the binary fingerprint. Clustering algorithms that operate on a finite number of possible similarity values are prone to ties in proximity and other algorithm dependent decision ties that lead to ambiguous clustering results. An ambiguous clustering is often just one of a large number of distinct possible results that may differ widely. Examples can be obtained by performing a partitioning heuristic on the results of an algorithm where there are simple changes in the order of the data input. We show the different decision tie behavior and ambiguity for several common clustering algorithms: RNN hierarchical algorithms (Wards, Complete Link, and Group Average), a leader algorithm (Taylor-Butina), Jarvis Patrick, and K-means. We show the impact of using different similarity measures (Euclidean, Tanimoto, Cosine), and different length fingerprints, on ambiguity. We then show possible strategies to lessen ambiguity, and how to quantify the ambiguity. Lastly, we show how ambiguity measures can be used in conjunction with level selection, thresholding, or other partitioning heuristics relevant to the particular clustering technique, to produce less ambiguous and more easily understood results.


Ligand-Based Virtual Screening Using Molecular Fields

Timothy J. Cheeseright, Andy Vinter, and Mark D. Mackey*
Cresset BioMolecular Discovery Ltd, Spirella Building, Bridge Road, Letchworth, SG6 4ET, UK

Virtual screening for new hits and leads for GPCR and ion-channel projects is a process with few success stories. The lack of xray data prevents the most common ‘docking’ methods from working and most ligand-based methods fail to find completely new structural classes.

We will give details of a new method of describing ligands. Our field technology represents ligands in terms of the electron cloud that they present to the receptor. The surface and shape properties of a molecule are described as ‘field points’. Comparison of the field point patterns of different ligands that bind at the same receptor can give a putative binding conformation. Furthermore these patterns can be encoded as a 1D vector, which when combined with a similarity metric allows molecules’ field properties to be rapidly compared. Population of a database with commercially available compounds and comparison of a known active to the database allows virtual screening using fields to be applied to lead discovery. Validation on a peptide-ligand GPCR target returned a 30% hit rate (27 of 88 compounds selected had activity better than 10uM). The majority of hits had no structural similarity to any known inhibitor, demonstrating that this technique can find entirely novel structural motifs.


Boxing Clever with CoMFA

Jonathan D. Hirst and James L. Melville*
School of Chemistry, University of Nottingham, University Park, Nottingham, UK

Since its inception in 1988, the CoMFA approach to 3D QSAR has proved enormously popular. Despite this, CoMFA analyses can be highly sensitive to the parameters used to construct the matrix of 3D descriptors. One area of investigation, rarely mentioned in the literature, is the use of alternative methods within CoMFA to generate the steric and electrostatic fields. `Abrupt', `smooth' and `box' methods for the calculation of electrostatic and steric field values in the technique have been assessed on three diverse datasets of medicinal chemistry interest. Significant improvements in both the quality and robustness of the statistical models can be achieved, without recourse to variable selection procedures or extra software, by computing fields with the little-used non-default `box' option.


Automated Decision Support for the Screening Process

Christos A. Nicolaou1*, D.A. Kleier2, T. K. Brunck1, P.A. Bacha1
1Bioreason, Inc., 150 Washington Ave., Suite 220, Santa Fe, NM, USA
2DuPont Agricultural Products, Stine-Haskell Research Center, Newark, DE, USA

This presentation describes new, automated approaches for the extraction of Structure-Activity Relationship (SAR) rules that are both, easy to interpret by humans and simple to apply in the quest to detect noise and identify potential false positives and negatives in screening datasets. The approaches, developed in association with a corporate collaborator, are based on proprietary algorithms which combine proven knowledge-based techniques from the data mining and knowledge discovery fields with the traditional human expert driven method. Step by step, the method analyzes a screening dataset and detects structurally homogeneous classes of chemical compounds, constructs R-Tables for each class, calculates a novel type of descriptors which are position-specific, computes a predictive model using a proprietary decision-tree algorithm and extracts concise, meaningful and statistically significant SAR rules. The resulting model, specifically the set of rules, can then be used to design and virtually screen lists of compounds to test in subsequent rounds of screening. We briefly present each step of the method placing emphasis on the unique qualities of the Position-Specific Descriptors (PSD) and the characteristics of the novel decision-tree model and the SAR rules. Examples from a retrospective analysis of a screening dataset including generated classes, R-Tables, SAR rules and compounds placed on secondary/confirmatory screening lists are presented.


Automated Generation of Structural Molecular Formulae Under User-Defined Constraints

Patrick Fricker1*, Marcus Gastreich2 and Matthias Rarey3
1Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI) Schloss Birlinghoven, 53757 Sankt Augustin, Germany (urrent address: University of Hamburg, Center for Bioinformatics (ZBH), Bundesstrasse 43, 20146 Hamburg, Germany
2BioSolveIT GmbH, An der Ziegelei 75, 53754 Sankt Augustin, Germany
3University of Hamburg, Center for Bioinformatics (ZBH), Bundesstrasse 43, 20146 Hamburg, Germany

The growing number of applications dealing with large sets of molecules from virtual screening via de-novo design and library design to HTS analysis and lead optimization approaches need to present molecule structures to the user in a convenient way to browse through at least tens of structures in a short time.

To obtain a fast visual impression of such a large set of molecules, 2D molecule diagrams are one of the favored depictions. In such an environment, it is desirable to influence the layout of the diagrams depending on the application. Usually, the molecules of interest show some relationship to each other: for example, in a combinatorial library, they have a common core structure, in virtual screening to a given query molecule and in HTS analysis they are similar to each other forming clusters. Here, it is necessary to draw the molecules in a way that the relationships among them become visible for the user of the software.

We present a new algorithm for automated creation of 2D structural formula of molecules1. The algorithm is based on the classical scheme of a drawing queue placing the molecular fragments in a sequential way. We extended the concept of prefabricated units developed for complex ring systems to automatically created drawing units for chains and rings which will then assembled in a sequential fashion. In order to deal with combinatorial libraries, the drawing algorithm was modified such that the common core is uniquely oriented in all drawings.

Further on, we give an algorithm which enables the drawing of 2D structural formula under directional constraints assigned to a subset of bonds. The directional constraints are applicable to different types of scenarios. It is feasible to let constraints be automatically derived by software tools for molecular design. One such automated application of the constraints which has been implemented employs Feature Tree2,3 similarity matchings. Such matchings give pairwise alignments of respective pharmacophoric groups, for example the OH-group of one compound may often match the mercaptane-functionality of another. Utilizing this information as a constraint for drawing leads to a directed depiction of the molecule, from which the user can more easily re-establish the similarity of a set of compounds with a common pharmacophore.

The introduced method was evaluated on the NCI cancer library for performance and coverage. For drawing of common-core structures, the method was applied to a combinatorial library based on the Ugi reaction. The directional constraints in combination with the Feature Trees were used to draw diagrams for angiotensine-converting enzyme (ACE)-inhibitors. The drawings clearly show the automatically identified similarities within the compound set which are in agreement with a known pharmacophore for this inhibitor class. This algorithm creates drawings of small organic molecules under constraints in the order of hundred structures per second.

References:
1 Fricker, P., Gastreich, M., and Rarey, M.: Automated Generation of Structural Molecular Formulae Under User-Defined Constraints, in preparation
2 Rarey, M., and Dixon, J.S.: Feature Trees: A new molecular similarity measure based on tree matching, J. Comput.-Aided Mol. Design, 12, 471--490 (1998)
3 http://www.biosolveit.de/FTrees


Identification of Relevant Sub-structures in Screening Data

Stephan Reiling*, Steven Burkett and Roman Dorfman
Aventis, Drug Innovation & Approval (DI&A), Lead Generation Informatics, Bridgewater, New Jersey, USA

The most frequently used approaches to analyze screening data, either high-throughput screening or secondary assay data, are to cluster compounds by structural similarity or to use a Maximum Common Substructure (MCS) detection method. Both of these approaches do not take into the account the activity of the compounds, i.e. they will identify structural features of molecules that occur frequently, but do not necessarily distinguish between active and inactive compounds. A new method for the identification of relevant sub-structures in screening data is presented. The method combines aspects of substructure searching, Maximum-Common Substructure (MCS) detection and clustering in a new way. The method uses the structures and the associated activities of compounds in the data set to identify structural features that distinguish between active and inactive molecules. Originally developed for the analysis of HTS data, the method is capable of analyzing large data sets and can deal with noisy data. The underlying algorithms of the method are presented, together with examples of its application.


Virtual Screening of Low-Molecular Weight Compounds

Richard Taylor
Astex-Technology Ltd., 436 Cambridge Science Park, Milton Road, Cambridge, CB4 0QA, UK

Screening for low-molecular-weight hits/leads is becoming more popular since drug-sized leads are often more difficult to optimise. The use of structure-based virtual screening techniques is an attractive method to rapidly identify these lead-like compounds.

The web-based virtual screening platform developed at Astex Technology using an in-house version of the protein-ligand docking program GOLD, will be presented. Successful application of the technology platform will be demonstrated on four targets namely PTP1B, CDK2, Neuraminidase and Estrogen Receptor. Furthermore, the use of pharmacophoric restraints and consensus scoring methods to enable the prediction of binding modes and to increase enrichment factors will be presented along with a new “Hybrid” scoring function. This function enhances binding mode prediction by combining different scoring functions into a single function, giving success rates of 90% within 2 Angs. RMSD for a data set of 79 fragment-like compounds.


Pharmaceutical Compound Bank Cleaning: Process

Alan Tinker* and Nick Tomkinson
Department of Physical & Metabolic Science, AstraZenca R & D Charnwood, Bakewell Rd., Loughborough, Leicestershire, LE11 5RH, UK

The popular philosophy that useful hits from HTS campaigns should have physical properties no worse than “druglike”1and ideally “leadlike”2 is now widely accepted. This, together with concerns about rising HTS costs and issues with the follow-up of lists of chemically unattractive compounds post HTS, prompted AZ Charnwood to carry out a critical review of the content of their proprietary compound bank in early 2001. From this it became clear that a significant proportion of these compounds were not remotely druglike (according to Lipinski’s criteria) and that an unacceptable number of chemically reactive or potentially toxic substances were also included in the screening set.

The screening collection was split into 3 subsets – Good, Bad and Ugly. The Good set is now routinely screened in every HTS campaign and contains compounds which are at least druglike and preferably leadlike. The Bad set is more forgiving of physical properties and is screened for specific targets where it is thought that, for example, molecular weight may need to be larger (e.g. ligands for membrane-bound receptors or inhibitors of protein/protein interactions) or where a topical rather than oral route of administration is ultimately envisaged. Ugly compounds were excluded from the screening set.

We are aware that other pharmaceutical companies have performed a similar exercise but, to the best of our knowledge, the actual methods used in achieving this end have not been described. In addition to a description of the philosophy and discussions behind the effort, this presentation will focus also on the practical aspects, including costs of HTS and post HTS processing. Current issues will be presented, including compound design and purchase - examples will be given of compounds available for purchase which would pass current filters and yet would not be considered acceptable for HTS. HTS vs non-HTS requirements will also be discussed.

References:
1. Lipinski et al. Adv Drug Del. Rev. 1997, 23, 3-25.
2. Davis et al. Angew. Chem. IEE 1999, 38, 3743-3748.