2^nd Joint Sheffield Conference on Chemoinformatics:
Computational Tools for Lead Discovery

9th-11th April, 2001

Oral Presentation Abstracts

Making lead discovery less complex

Michael M. Hann, Andrew R. Leach and Gavin Harper
Computational Chemistry and Informatics Unit GlaxoSmithKline Medicines Research Centre, Gunnels Wood Rd, Stevenage Herts, SG1 2NY, UK

Using a simple model of ligand-receptor interactions, the interactions between ligands and receptors of varying complexities are studied and the probabilities of binding calculated. It is observed that as the systems become more complex the chance of observing a useful interaction for a randomly chosen ligand falls dramatically. The implications of this for the design of combinatorial libraries is explored. A large set of drug leads and optimised compounds is profiled using several different properties relevant to molecular recognition. The changes observed for these properties during the drug optimisation phase support the hypothesis that less complex molecules are more common starting points for the discovery of drugs.

Implications for the design of libraries and screening sets will be discussed based on these theoretical and experimental observations.

The design of leadlike combinatorial libraries.

Simon J. Teague, Andrew M. Davis, Paul D. Leeson, and Tudor Oprea
AstraZeneca R&D Charnwood, Loughborough, Leicestershire, UK

The authors proposed that the properties required of library compds. intended to provide leads sutable for further optimization may be rather different. The selection of leadlike compds. for futher optimization eases the pressure on subsequent and more labor-intensive parts of drug discovery process.

Combinatorial networks: Towards the unthinkable

Dimitris K. Agrafiotis and Victor S. Lobanov
3-Dimensional Pharmaceuticals, Inc., 665 Stockton Drive, Exton, PA 19341, USA

This paper presents some recent developments on the use of machine learning techniques for the analysis and virtual screening of massive combinatorial libraries. In particular, we present a new class of multilayer perceptrons, which are trained to predict molecular properties of combinatorial products from pertinent features of their respective building blocks. This method limits the expensive enumeration and descriptor generation to only a small fraction of products and, in addition, relieves the enormous computational effort required for other common post-processing tasks, such as principal component analysis, multidimensional scaling, etc. In effect, the method provides an explicit mapping function from reagents to products, and allows the vast majority of compounds to be analyzed without constructing their connection tables. We demonstrate that this approach works well with a wide variety of molecular descriptors, and makes product-based experimental design applicable to virtual libraries of any conceivable size in time frames orders of magnitude shorter than those required by conventional methodologies.

Getting past diversity in assessing virtual library designs

Robert D. Clark
Tripos, Inc., St. Louis, MO 63144, USA

A great deal of effort is currently going into the design of combinatorial libraries. Published approaches are generally "optimal" in that each best satisfies the target objective function it employs. Relative efficiencies of execution can be compared in such cases, but it is often difficult to compare libraries generated by different methods or even by different parameterizations of the same method. This is particularly true once it is appreciated that attributes other than molecular diversity are important. This paper will discuss several ways in which library designs can be meaningfully compared to one another, visually as well as numerically.

Better (combinatorial) screening through P-chem

Robert S. DeWitte
Advanced Chemistry Development 90 Adelaide St., W. Toronto, Ontario, M5H 3V9, Canada

Expedient identification of hits from among combinatorial libraries can be aided by careful attention to the physical chemistry of compounds, mixtures and assay conditions. This talk will review applicable principles that relate physical chemistry to screening success by using Advanced Chemistry Development prediction algorithms (Solubility, LogP, pKa, LogD, Sigma) to improve compound pooling, library design (solubility, drug-like-ness and diversity), and screening conditions. ACD's batch software for physical chemistry prediction will be highlighted.

Application of non-parametric regression to quantitative structure-activity relationships

Jonathan D. Hirst ¹*, T. John McNeany ¹, Lewis Whitehead ² and Trevor Howe ²
¹ School of Chemistry, University of Nottingham, Nottingham, NG7 2RD, UK
² Novartis Horsham Research Centre, Wimblehurst Road, Horsham, West Sussex, RH12 4AB, UK

Several non-parametric regressors have been applied to modelling quantitative structure-activity relationship (QSAR) data. Performances were benchmarked against multilinear regression and the nonlinear method of smoothing splines. Variable selection was explored through systematic combinations of different variables and combinations of principal components. For the data examined, 539 inhibitors of a well characterised serine kinase of interest in cell signalling cascades, the best two-descriptor model had a five-fold cross-validated q-squared of 0.43, and was generated by a multi-variate Nadaraya-Watson kernel estimator. Other approaches did not perform as well. A modest increase in predictive ability can be achieved with three descriptors, but the resulting model is less easy to visualise. We conclude that non-parametric regression offers a potentially powerful approach to identifying the most predictive low-dimensional QSAR.

Unsupervised forward selection - a data reduction algorithm for use with very large datasets

David Whitley, Martyn Ford and David Livingstone
Centre for Molecular Design, University of Portsmouth, PO1 2DY, UK

Unsupervised Forward Selection (UFS) is a data reduction algorithm that selects from a data matrix a maximal linearly independent set of columns with a minimal amount of multiple correlation. When used in combination with a generalised regression procedure such as Continuum Regression, UFS leads to parsimonious prediction models for drug design that are both stable and accurate. The procedure has potential as a pre-processing strategy for handling the large amounts of information that are obtained from automated screening procedures prior to establishing quantitative structure-activity relationships.

UFS was designed for use in the development of Quantitative Structure-Activity Relationship (QSAR) models, where the m by n data matrix contains the values of n variables (typically molecular properties) for m objects (typically compounds). QSAR data sets often contain redundancy (exact linear dependencies between subsets of the variables), and multicollinearity (high multiple correlations between subsets of the variables). Both of these features inhibit the development of QSAR models with the ability to generalise successfully to new objects. UFS produces a reduced data set that contains no redundancy and a minimal amount of multicollinearity.

UFS is a forward stepping algorithm that proceeds as follows. First, the two columns with the smallest squared pairwise correlation coefficient are chosen. Then, for each i > 2 , column i is chosen from the remaining columns to have the smallest squared multiple correlation coefficient with columns 1 to i-1 . The process halts when the number of columns selected reaches the rank of the data matrix; that is, when the squared multiple correlation coefficient of each remaining column with those already selected equals one. Thus the algorithm builds a basis for the column space of the data matrix, minimising the multicollinearity in the reduced data set at each stage.

In practice, the ufs program excludes variables with standard deviation less than sdevmin and halts when the squared multiple correlation coefficient of each remaining column with those already selected exceeds rsqmax. Both sdevmin and rsqmax are user adjustable parameters. The columns of the data matrix are

usually mean-centred before any calculations are performed, but this behaviour may be over-ridden by the user.

This presentation will elaborate on the development and implementation of the algorithm and give examples of applications using 2-D and 3-D QSAR datasets. The problem of chance correlation will also be addressed.

Increasing the value of crystallographic databases for drug design

Robin Taylor, Andreas Bergner, Jason Cole, Magnus Kessler, Jie Luo, Barry Smith and Marcel Verdonk
CCDC, 12 Union Road, Cambridge CB2 1EZ, UK

A key aim of recent work at CCDC has been to increase the value of crystallographic databases for drug design. New programs include:

Relibase+ - Based on the ReLiBase program (M. Hendlich, Merck, Germany), Relibase+ allows 2D, 3D, similarity and sequence searches of protein-ligand crystal structures. Advanced features include automated superposition of similar binding sites, analysis of ligand-induced conformational changes, ligand hit-list manipulation and searches for protein-ligand nonbonded interactions.

Mogul - Mogul permits rapid retrieval of molecular geometry information from the Cambridge Structural Database (CSD). The immediate chemical environment of a bond length, valence angle or acyclic torsion angle in a user-defined molecule is algorithmically converted to a series of keys, which include information about element types, hydrogen counts, bond types, etc. Search of a tree indexed on these keys permits extremely rapid retrieval of similar bond lengths, angles or torsions from the CSD, without the need for the user to draw a search substructure. A process of key "generification" can be used to broaden the search should there be insufficient hits from the original search. A generified search essentially corresponds to a series of searches for related substructures, generated automatically by the program. The results of a generified search can be refined interactively by accepting or rejecting the contributions from individual search substructures.

SuperStar - SuperStar uses a knowledge-based methodology to predict binding points in protein cavities and can therefore be used to generate pharmacophores. Recent work on the program has focussed on predicting metal coordination and generating SuperStar maps from PDB as well as CSD data.

Future work will focus on integrating these and other programs, and using the results to drive docking and crystal-structure solution packages.

Bridging bioinformatics and chemoinformatics by using a knowledge-based potential for drug design

A.W. Edith Chan¹, Roman A. Laskowski² and Janet M. Thornton^2,3
¹Inpharmatica, 60 Charlotte Street, London W1P 2AX, UK
²Department of Crystallography, Birkbeck College London Malet Street London WC1E 7HX, UK
³Structure and Modelling Group Biomolecular Structure and Modelling Group University College London, Gower Street, Lodnon WC1E 6BT, UK

We have developed an automated system for generating drug design ideas for given protein targets. The system relies on an automated method of identifying and delimiting known or probable binding sites and a knowledge based potential that describes favorable atom-atom interactions available within those sites. The potential is derived empirically from known protein and protein-complex structures and is based on 3D spatial distributions of atomic contact preferences between different atomic types. It contains no assumptions about energy functions and so reflects the way that atoms in proteins interact. The information on the target protein is further complemented by a comprehensive database of bioinformatics data holding sequence and structural alignments from all closely and distantly related proteins.

The outputs include graphical maps of the binding sites, and the favorable atomic interactions within them, which can be viewed in the most commonly used molecular graphics software packages, such as Sybyl, Quanta, RasMol, etc. The maps can contain consensus information from related the protein structures, showing either commonalities, or differences that may be significant for specificity.

Also output is a pharmacophore for searching small-molecule databases and a map of small molecule fragments that allows identification of common organic building blocks likely to bind most favorably in the active site; this knowledge can be of direct benefit in the design of combi-chem.

Using the above analytical procedure, our chemoinformatics program has established a knowledge-based database containing the pre-calculated consensus information on protein families such as HIV, kinases etc. The data include sequence/structure alignments, their structural folds and motifs, ligand information, active site residues, active site regions and volumes, interaction probe maps, pharmacophores and lists of most favorable binding fragments.

New methods for studying receptor-ligand interactions

Zsolt Zsoldos and Aniko Simon
SimBioSys Inc., 135 Queen's Plate Drive, Unit 355, Toronto, Ontario, M9W 6V1, Canada

The presentation focuses on various aspects of receptor-ligand interaction studies. First a set of new mathematical models will be presented for estimating the energy contributions of various interaction types between biological macromolecules (protein, RNA or DNA receptors) and drug-like small molecule ligands. The considered interaction types include hydrogen bonding, hydrophobicity, electrostatics, metal ions, covalent bonding patterns.

Then our software system will be described, which provides a very efficient way of calculating and evaluating continuous energy fields described either by our models or any standard force field, i.e. our algorithms for fast evaluation of 3D volumetric functions are not specific to our models, but generally applicable to any well-known force field, e.g. Amber, CHARMM, MM3, OPLS.

Finally, our interaction field calculation system will be demonstrated on practical examples by comparing numerical results of the estimated interaction energy with experimental values and results of theoretical (quantum chemical) calculations. Furthermore, the calculated force fields will be shown by novel volumetric visualization techniques.

For more information, see the company's web site: http://www.simbiosys.ca/

Generating synthetically accessible ligands by de novo design

Krisztina Boda and A. Peter Johnson
School of Chemistry, University of Leeds, Leeds, LS2 9JT, UK

One of the deficiencies of de novo molecular structure design programs is that after a structure generation process which may be very demanding of computer resources, many of the solutions produced may not be synthetically accessible. The CAESA program attempts to overcome this deficiency by post generation scoring and ranking according to an estimate of synthetic accessibility, but this approach is inefficient in that large numbers of structures are generated and then pruned in a computationally demanding process.

The approach used in SYNSPROUT, a new variant of SPROUT, is to build synthetic constraints into the structure generation process by staring with a library of readily available starting materials, which are used in both the initial docking process and also in a build up process which only permits joins which correspond exactly to a chemical reaction defined in a user created knowledge base.

The current version of the program works well with medium sized databases of starting materials. For large databases such as ACD,the combinatorial nature of the structure generation process means that even the recently developed parallel version would be too slow and work in hand is geared to overcoming this problem. The presentation will provide an overview of the problems encountered and some solutions together with examples of the system in action.

Analysis of flexible ligands: from conformational libraries to leads

Jonathan M. Goodman and Setu P. Roday
University of Cambridge, Cambridge, UK

Flexible molecules may be useful ligands, but they are challenging to analyse computationally because of their conformational freedom. This problem can be approached by using genetic algorithms to build up confomation libraries of structures, and using these libraries to generate virtual combinatorial libraries of related compounds. This sharing of information provides a rapid method of analysing potential ligands.

Flexsim-R: A new set of 3D descriptors for combinatorial library design and in-silico screening

Hans Briem¹, Hans Matter², Andreas Teckentrup¹ and Alexander Weber¹
¹Boehringer Ingelheim Pharma KG, Dept. of Lead Discovery, D-55216 Ingelheim, Germany
²Aventis Pharma, DI&A Chemistry, D-65926 Frankfurt, Germany

The idea of using in-vitro affinity fingerprints as similarity or QSAR descriptors has been introduced in 1995 by Kauvar et al. [1] The underlying hypothesis behind this method states that compounds having a similar in-vitro binding profile in a set of reference assays may also exhibit a similar behaviour in any new target assay.

By replacing the in-vitro assays with computational docking or superpositioning, we have recently transferred this approach into in-silico screening experiments, resulting in so-called virtual affinity fingerprints [2,3,4].

Here we report on a further extension of this methodology, which is particularly suited to assess similarities of smaller reagent-sized molecular fragments typically used for combinatorial libraries or parallel synthesis.

We will describe the development and validation of these novel 3D descriptors. In addition, possible applications for combinatorial library design and virtual screening will be discussed.

[1] L. M. Kauvar et al., Chem. Biol., 2, 107 (1995).

[2] H. Briem, I. D. Kuntz, J. Med. Chem., 39, 3401 (1996).

[3] U. F. Lessel, H. Briem, J. Chem. Inf. Comput. Sci., 40, 256 (2000).

[4] H. Briem, U. F. Lessel, Perspect. Drug Design Discovery., 20, (2000) in press.

Similarity searching in large combinatorial chemistry spaces

Matthias Rarey¹ and Martin Stahl²
¹GMD -- German National Research Center for Information Technology Institute for Algorithms and Scientific Computing (SCAI), 53754 Sankt Augustin, Germany
²F.Hoffmann - La Roche AG Pharmaceutical Division, Molecular Design and Bioinformatics, CH-4070 Basel, Switzerland

Similarity searching is a central concept for lead finding whenever a three-dimensional structure of the target protein is unavailable. The search space could either be a compound database, the closed form of a combinatorial library, or a more generally defined "chemistry space". Under chemistry space, we understand a large set of diverse fragments together with generic definitions of how the fragments can be combined to molecules. Most similarity searching methods available today are based on descriptors like structural keys or fingerprints converting the molecule into a linear string representation. These descriptors have been proven to be efficient for similarity searching in databases of explicitly enumerated compounds. However, there is no efficient way known to combine string descriptors for molecular fragments to descriptors for molecules of arbitrary topology that can be built from these fragments.

Three years ago, the feature tree descriptor was developed [1]. This descriptor represents a molecule by undirected labeled trees. Each tree node represents a small fragment or building block of the molecule, the node label stores information about the shape and physicochemical properties of the fragment. A comparison of two molecules is performed by creating a matching between their feature trees and a similarity value is calculated based on this matching. Since the feature tree descriptor already represents a molecule by its fragments, it can be used for efficient searches in chemistry spaces.

Using the feature tree descriptor, we have developed a method to search combinatorial libraries and large chemistry spaces. The method is based on a dynamic programming algorithm that makes it possible to avoid enumerating or constructing any molecules but the actual hit molecules in the final stage. Under certain parameter settings (the similarity value must be additive over fragments and the physicochemical properties stored at a node must depend only on the fragment the node represents), we can show that the search is optimal, i.e. that the molecules most similar to the query molecule are created. The method can also be applied to search for molecules on an arbitrary similarity level with respect to the query molecule. In addition, it can be used to create a diverse subset of molecules similar to the query on a given similarity level.

The new search method is applied to a chemistry space created by shredding the WDI into small fragments [2,3]. The space contains about 17000 fragments which can be connected to each other via 12 different link types. The space is theoretically infinite. Limited to reasonably sized molecules (consisting of less than 6 fragments), it still contains about 2.15e+18 molecules. Depending on the size of the query molecule, this space can be searched In between 2 and 20 minutes on a single CPU workstation.

In order to test the method, several drug-like molecules have been used as queries. Searching on similarity level 1.0, the query molecules and closely related analogs are consistently retrieved. Searching on a lower similarity level, structurally more diverse molecules still related to the query are found. As an application scenario, we show that the method is able to jump between known inhibitors of different structural classes on sets of about 50 dopamine d4 antagonists and 50 histamine H1 antagonists.

[1] Rarey, M. and Dixon, J.S.: Feature Trees: A new molecular similarity measure based on tree matching, J.Comput.-Aided Mol.Design, 12, pp 471--490 (1998)

[2] Lewell, X. Q.; Judd, D. B.; Watson, S. P. and Hann, M. M. RECAP - Retrosynthetic combinatorial analysis procedure: A poweful new technique For indentifying privileged molecular fragments with useful applications in combinatorial chemistry. J. Chem. Inf. Comput. Sci. 38, 511--522, (1998)

[3] Schneider, G.; Lee, M.L., Stahl, M. and Schneider, P.: De novo design of molecular architectures by evolutionary assembly of drug-derived building blocks, J.Comput.-Aided Mol.Design, 14, pp 487--494 (2000)

Know more before you score: Analysis and optimization of structure-based virtual screening protocols

Andrew C. Good, Daniel L. Cheney, William E. Harte, Yi Li, Stanley R. Krystek, Donna A. Bassolino, John S. Tokarski, Terry R. Stouch, Yaxiong Sun, Malcolm E Davis, Deborah Loughney, Jonathan S. Mason and Doree F. Sitkoff,
Bristol-Myers Squibb, 5 Research Parkway, Wallingford, CT 06492, USA

There has been much research into the development of new scoring functions for structure-based virtual screening. While some advances have been made in improving virtual screening results through this approach, in general progress has been limited. Here we highlight results obtained from DOCK studies designed to improve virtual screening via analysis of the screening phases that occur before scoring. Conformational analysis studies were undertaken on diverse PDB ligands using a variety of techniques.

Search methods were compared via their ability to reproduce conformers close to the bioactive structure at sampling levels typically employed in virtual screening. In addition, 5 different target proteins each with associated active compound data sets were used to analyze the effect of docking variables such as ligand flexibility, site point definition and node sampling levels. The ranking of these active compounds when combined with a set of ~10000 "noise" compounds was used to compare screening enrichment levels and hence better determine optimum DOCK search paradigms. The results of these studies are discussed and their implications for the direction of future virtual screening research are highlighted.

DockCrunch and beyond: The future of receptor-based virtual screening

B. Waszkowycz, T. Perkins and J.Li
Protherics Molecular Design, Lyme Green Business Park, Macclesfield, SK11 0JL, UK

High-throughput biological assays are now a well-established and routine tool in the search for new lead compounds in the pharmaceutical industry. Yet there are situations in which they can be usefully complemented by virtual (or `in silico') screening strategies. This is particularly true when the compounds are not available in-house, for example, when assessing the purchase of external databases or selecting subsets of very large virtual combinatorial libraries.

A number of computational methods are available to perform virtual screening, ranging from 2-D molecular similarity scores through pharmacophore methods to molecular docking. In those cases where a receptor structure is available, docking has the advantage, at least in principle, of being able to identify any ligand capable of binding to the receptor in any binding mode. This implies that docking methods are likely to highlight more structural classes as potential ligands than other virtual screening methods.

Protherics' virtual screening technology is based around PRO_LEADS, the docking module of our Prometheus software: each molecule is docked flexibly into the complete receptor and its binding energy predicted. The scoring function (ChemScore) was calibrated using X-ray crystal complexes to predict the free energy of binding. Parameters for this type of empirical function can only be established using ligands that form crystallographic complexes, and features detrimental to binding may therefore be under-represented in the function. This may result in some random compounds obtaining over-predictions of their binding energy.

Our approach in virtual screening is to calculate, from the docked state, a number of steric, hydrogen-bonding, and electrostatic descriptors that encode ligand-receptor complementarity. These can then be used as filters to select subsets of interesting potential `hits'.

The talk will describe DockCrunch, Protherics' collaboration with SGI to dock more than one million molecules into two forms of the estrogen receptor, and the filtering tools that allowed rapid analysis and subset selection. As a proof-of-concept exercise, 37 compounds were purchased, and assayed by an independent company. Of these, 21 had microM or better activity, with a number having Kis in the range 10-100 nM.

None of the compounds is structurally similar to literature estrogen ligands. The application of this technology to virtual screening of combinatorial libraries will also be presented, together with a description of the current developments to the docking and analysis methodologies.

Molecular simlilarity and chemical families. The homogeneity approach.

Christos A. Nicolaou, Brian P. Kelley, David W. Miller and Terrence K. Brunck
Bioreason, Inc., 150 Washington Ave., Suite 303, Santa Fe, NM 87501, USA

Molecular similarity and diversity methods are widely used in the modern drug discovery process. Active research is performed on their use to select sufficiently diverse and representative subsets from large datasets, such as the corporate compound library, for screening [1]. In addition, a variety of computational methods, such as clustering and classification, are using molecular similarity to organize compounds in chemically sensible clusters [2, 3] or classify compounds in the appropriate classes [4]. Most compound similarity methods rely heavily on compound representation methods [3, 5] used in conjunction with similarity and distance (dissimilarity) measures [5]. In this research we focus on the problem of identifying groups of compounds that are chemical families as defined and expected by chemists in addition to having similar molecular representations. To this end we attempt to define chemical families and introduce the concept of homogeneity. We then present a range of new approaches for assessing compound set homogeneity and chemical family formation. We also discuss the most common methods for assessing molecular similarity and diversity and emphasize the special case of using fragment-based descriptors to represent molecules and then deducing the similarity of those molecules by relating it to the distance of their respective vectors. In the empirical section we use old and new methods to assess the degree to which selected groups of compounds resulting from a clustering or classification analysis form a chemical family.

Agrafiotis D. K.; Stochastic algorithms for maximizing molecular miversity, J. Chem. Inf. Comput. Sci. 1997, 37: 841-851.
Brown, R.D., Martin, Y.C.Use of structureactivity data to compare structurebased clustering methods and descriptors for use in compound selection,J. Chem. Inf. Comput. Sci. 1996, 36:572-584
Wild, D.J., and Blankley, C.J. Comparison of 2D fingerprint types and hierarchy level selection methods for structural grouping using Wards clustering, J. Chem. Inf. Comput. Sci. 2000, 40:155-162.
Miller D. W., Results of a New Classification Algorithm Combining K Nearest Neighbors and Recursive Partitioning, J. Chem. Inf. Comput. Sci. 2001, 41: 168-175
MacCuish J.D., Nicolaou C.A. and MacCuish N.J., Ties in proximity and clustering compounds, J. Chem. Inf. Comput. Sci. 2001, 41: 134-146

Validating new techniques for HTS data analysis

Alain Calvet, Kjell Johnson and George Cowan
Pfizer Global Research and Development, Ann Arbor Laboratories, Michigan, USA

The current challenges in analyzing information contained in data sets from high (or ultra high) throughput screening are size of the data set and noise. We have addressed these challenges through the use of different computerized methods. Recursive Partitioning, Self-Organizing Maps (Kohonen maps), Linear Vector Quantization, a proprietary artificial intelligence method: Version Space SAR, Partial Least Squares and Support Vector Machine have been tested. This presentation describes results obtained by using the last three of these techniques

Filtration of data allows to eliminate compounds which may have been mislabelled by the first pass screening. Models of activity are then built using this improved data set for learning.

Validation is done using secondary screening (i.e. IC₅₀ determination) results when they are available. A few hundred active compounds are easily retrieved from a corporate library of more than 300000 compounds.

Results show that these techniques are suited for the analysis of large collection of compounds. We show that combining results from different analysis or from different techniques further enhance the performance of the classification.

Modeling HTS data for virtual screening

Robert Brown, Marvin Waldman, Tom Stockfisch and Moises Hassan
Molecular Simulations Inc, a subsidiary of Pharmacopeia, Inc., 9685 Scranton Road, San Diego, CA 92121, USA

With the ever-growing size of compound libraries and associated assays from High Throughput Screening (HTS), the need has emerged for statistical analysis tools that can operate within the constraints of the size and semi-quantitative nature of the data. We have compared the ability of such tools to build predictive models from HTS data. These models are designed to be used for the in-silico screening of virtual libraries, allowing only a focused subset to be physically made and tested.

Fast descriptor calculation for combinatorial libraries

Geoff Downs and John Barnard
BCI Ltd, 46 Uppergate Road, Stannington, Sheffield, S6 6BX, UK

This presentation will describe methods, utilising Markush structure representations of combinatorial libraries, for fast enumeration of the specific structures covered, and for generation of a range of structural descriptors. The algorithms work by analysing the Markush structure to generate descriptors directly, and offer order-of-magnitude speed improvements over approaches based on first enumerating and then analysing each molecule individually. The internal Markush representation can be built directly from commonly-used Markush descriptions of libraries (e.g. RGfiles, cSLN) or incrementally from generic reaction and precursor molecule descriptions.

Previously-described work on generation of fragment-based structure fingerprints has been extended to calculation of property values such as the "Lipinski" descriptors and topological indices. logP values are calculated using an adaptation of a recently-published atom-additive method (Wildman and Crippen, JCICS 39, 868-873, 1999), in which the atom types and associated contributions are specified in a fragment dictionary file, analogous to that used for structure fingerprint generation. A range of topological indices can be calculated, including the Kier subgraph index counts and Chi Connectivity Indices. Though no limit is imposed on the maximum order of these indices, an analysis of generation times for higher-order indices illustrates some of the problems inherent in their calculation, and some limitations of the Markush approach. The use of these descriptors for clustering library members will also be mentioned.

Work on rapid comparison of libraries, and identification of overlapping structures, will be described; this is based on the use of structure-matching algorithms originally developed for chemical patent search systems. A possible approach to rapid generation of 3D descriptors for library members, using Markush structures, will be outlined.

A new association coefficient for molecular similarity

Michael Fligner, Joseph Verducci and Paul Blower
Department of Statistics, The Ohio State University and LeadScope, Inc., Columbus OH, USA

The growth of robotic techniques such as combinatorial chemistry and high throughput screening have dramatically increased the speed and quantity of compounds that are made and tested for biological activity. To cope with the enormous amounts of data being produced, many groups use diversity analysis to select or design new compounds that are structurally dissimilar to compounds already tested. For cases where the molecular descriptors that take on discrete values, the Tanimoto association coefficient is the most commonly used measure of similarity or chemical distance between two compounds. However, when used to select dissimilar compounds, the Tanimoto coefficient has an intrinsic bias which tends to overselect smaller compounds. We have developed a new association coefficient that overcomes this bias. This paper will give details of the new coefficient and contrast the two coefficients for selecting diverse sets of compounds from a large collection.

MDL Keys Revisited

Joseph L. Durant, Burton A. Leland, Douglas R. Henry and James G. Nourse
MDL Information Systems, Inc. 14600 Catalina Street San Leandro, CA 94577, USA

Structurally and topologically based keys have a long history of use in cheminformatics. Present applications include fingerprinting, clustering, diversity analysis, decision tree construction and library evaluation. The successful use of MDL keys in these applications is interesting considering the fact that these keys were designed and optimized for substructure searching of databases.

MDL has previously exposed both a 166-bit keyset and a 960-bit keyset. The keys encode information on a variety of atomic properties located from 0 to 4 bonds apart, together with selected atomic environments and functional groups. The 960-bit keyset was developed and optimized for structure searching. The 166-bit keyset is a subset of this 960-bit keyset comprised of bits which are mappable onto chemically interesting features.

We have undertaken a re-examination of the entire MDL keyspace (approximately 3500 unique features, expandable by multiple occurrence reporting) with an eye towards customizing keysets for use in non-search applications. Results will be presented using these expanded keysets, comparing and contrasting their performance with that observed using the 166- and 960-bit keysets.

Empirical validation of the effectiveness of chemical descriptors for similarity analysis and datamining

Kirk Simmons
DuPont Ag Products, Stine-Haskell Research Center, 1090 Elkton Road, Newark, DE 19714, USA

Abstract; There is an on-going discussion in the computational community regarding chemical descriptors and their use in retrieving compounds for testing from corporate databases. This paper will present the results of a multi-year study in which several types of chemical descriptors were evaluated using actual high-throughput screening results derived from several unrelated biological targets. The focus of this study was to quantify the effectiveness of the more commonly used chemical descriptors at ranking databases in similarity analyses as well as their effectiveness in algorithms targeted toward datamining.

2nd Joint Sheffield Conference on Chemoinformatics: Computational Tools for Lead Discovery