Hierarchical Fuzzy Clustering with a Galois LatticeNicola J Richmond1
|1GlaxoSmithKline, Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2NY|
|A Galois lattice is a formal ontology derived from a collection of objects and their attributes. A node of the lattice corresponds to a subset of the objects sharing the same subset of the attributes and two nodes are connected if the object set of one node is a subset of the object set of the other node. This approach to structuring data has been widely used in a number of information related fields such as data mining, text mining, machine learning and knowledge management and more recently in the area of market basket analysis, where retailers seek to understand the purchasing behaviour of their customers to aid with store design and sales promotions amongst others. As an example, given a data set of consumers (object) and their transactions (attributes), a retailer can learn, via the lattice, that shampoo and conditioner are frequently bought together, so placing both products on promotion at the same time would not necessarily have a positive impact on profit.|
The use of Galois lattices also extends to bioinformatics but the use of these lattices in the field of chemoinformatics has been limited to the identification of privileged fragments. Here we describe an algorithm where the lattice itself, as opposed to the association rules it generates, is exploited to provide a hierarchical, fuzzy clustering that has a number of applications in chemoinformatics.
In our scenario, the objects of our data set correspond to molecules and the attributes correspond to the set of all possible fragments of these molecules. A node in the Galois lattice of this data then corresponds to a set of molecules sharing a set of common fragments. The bottom of the lattice contains single molecules and all their possible fragments and each node along an upward path contains a bigger superset of molecules, clustered together on smaller subsets of common fragments.
In general, Galois lattices can become large enough that the problem of mining the data becomes one of mining the lattice. To reduce the size and complexity of the lattice, we can stipulate a lower bound on the generated fragment atom count, thus reducing the number of lattice nodes and, at the same time, providing more chemically meaningful clusters. A further reduction in nodes can also be achieved by applying a minimum bound of the number of shared attributes.
There are two applications we focus on, one is the generation of a hierarchical clustering that can be visually inspected and interrogated within Cytoscape, the other is the automated R-group analysis of a SAR data set. We compare the application of Galois lattices to these problems with existing methodologies.
Typically, the computational chemist will manually fragment a SAR data set using a series of substructures searches. This process can be time consuming due to the number of repeats often required to achieve the desired fragmentation. This lattice based approach provides an automated fragmentation and hierarchy, and by incorporating experimental activity data, a lattice node can be assigned an average activity. Thus when walking up the lattice, the addition of a molecule and the subsequent loss of fragments can modify the average activity of the lattice node. These changes are easily identified and can provide valuable insight to the medicinal chemist.