Poster 49: GTM-based Classification Models and Their Applicability Domain: Application to the Biopharmaceutics Drug Disposition Classification SystemHéléna A. Gaspar1, Gilles Marcou1, Philippe Vayer2, Alexandre Varnek1
|1Laboratoire de Chémoinformatique, UMR 7140 CNRS, Université de Strasbourg, 1 rue B. Pascal, 67000 Strasbourg, France|
2Technologie Servier, Orléans, France
|Generative Topographic Mapping (GTM) is the probabilistic counterpart of Kohonen maps. Each i- th object in N-dimensional initial space is projected into k-th node of 2D latent space with a probability (“responsibility”) Rik. Thus, the object’s positioning on the map is defined by the center of gravity of distribution of its responsibilities. Ensemble of responsibilities for the whole dataset forms a probability density function which can be used for structure-property modeling in the framework of Bayes theorem . |
This approach has been used to build a four classes’ classification model for Biopharmaceutics Drug Disposition Classification System (BDDCS), in which each class is defined by aqueous solubility and metabolic stability of the molecules. The modeling has been performed on the dataset of 893 molecules from reference  using VolSurf descriptors.
GTM provides with the interesting way of visualization of the distribution of classes in the latent space. Thus, the class Cp could be assigned to a given node if its cumulated responsibility Pk(Cp) in this node is larger to that of any other class: Pk(Cp) / Pk(Cq) > CPF where CPF (class prevalence factor) is a user-defined parameter. Increase in CPF leads to the increase in “empty space” (white zones) on the map, in which no class is predominant. It has been shown that for CPF >= 5, most of the incorrectly classified molecules are mapped into this empty space. All objects projected inside this area have high probability to be correctly predicted (Figure 1).
Thus, GTM can be used both as a learning technique, a mapping of the model properties, and a representation of the validity domain of the model where the nodes delineate the domain of the GTM-based classification model.
 N. Kireeva; I. I. Baskin; H. A. Gaspar; D. Horvath; G. Marcou; A. Varnek Mol. Inf. 31 (2012) 301-312
 L. Z. Benet; F. Broccatelli; T. I. Oprea APPS J. 13(4) (2011) 519-547
Figure 1. Probability density distribution for four BDDCS classes for CPF = 1 (left) and 5 (right). On the maps, incorrectly predicted molecules are shown as black dots, “uncertainty” regions are empty (white) and regions where one class is predominant are shown by colored squares.