Poster 47: Ligand Based Virtual Screening using Ensemble Classification for the Identification of Anti-Tubercular MoleculesNisha Chandran1, Anurag Passi1, Abhilash Gangadharan2, OSDD Consortium1
2CSIR-Institute of Genomics and Integrative Biology
|Tuberculosis still persists as one of the deadliest diseases claiming nearly 1.3 million lives per year. The abundance of data being generated can be utilized in data-mining approaches targeted specifically for biological computational analysis. Cheminformatics and Pharmaceutical research have now started integrating these approaches into their drug discovery pipeline. These studies now complement the structure based virtual screening approaches.|
The work described here uses R the open source statistical package. CARET (Classification and Regression Training) package was further used for building the predictive models. The data sets were obtained from the bioassay confirmatory screens of the H37Rv strains of Mycobacterium tuberculosis performed by the Southern Research Institute, and available with Pubchem. CARET provides for building more than 85 classification and regression models. From these available models we divided our model building exercise into the following five broad families of classification:
1) Kernel Based Approaches
2) Decision Trees
3) Neural Networks
4) Bayesian Classification
5) Instance Based Learning
For building the training set descriptors were calculated using the freely available java library of CDK. Ten fold cross validation was performed for estimation of the performance of each model. The data was pre-processed and feature selection estimates were performed to ensure the robustness of the training set. Using this training set predictive models were built, whose accuracy could be measure by a test set. A small molecule set consisting of thirty thousand molecules were used as the test set. This set was then further analysed using the training set models.
Once the overall accuracy values were obtained for each classifier a weighted algorithm was applied so as to improve the accuracy of the ensemble of classifiers for each predictive model built. The top ranking molecules were picked for further screening.