Using Open Source Tools and Public Data to Build Machine Learning Models in Support of Neglected Diseases Drug DiscoveryPaul J Kowalczyk1
|1SCYNEXIS / P O Box 12878 / Research Triangle Park, NC, USA 27709-2878|
|We have built multiple predictive machine learning models (e.g., random forests, k-nearest neighbors, support vector machines, self-organizing maps, naïve Bayes) in support of neglected tropical diseases drug discovery. Screening data, retrieved from ChEMBL-NTD, is used to construct and validate the various models. Programs written using the R software environment, the Python programming language, pandas and scikit-learn (these last two being machine learning algorithms written in Python) are used for data retrieval, curation, visualization, analysis and mining. Compound identifiers, structural information (smiles strings or SD file) and associated activities constitute the input. Descriptors are calculated using rcdk (the R interface to the CDK libraries) and RDKit. We demonstrate how one might report these cheminformatics experiments as instances of reproducible research, i.e., how one might author and distribute integrated dynamic documents that contain the text, code, data and any auxiliary content needed to recreate the computational results. We show how the contents of these documents, including figures and tables, can be recalculated each time the document is generated. Each of these models has been collected into a compendium, a “container” for all those elements that make up a model and its associated description: the primary data, the annotated computational code, figures, tables and derived data together with textual documentation and conclusions. These compendia are interactive, being purposefully designed to allow researchers to reproduce, modify and extend the compendia’s components. Additionally, we show that, by exploiting their interactive nature, compendia may be constructed to serve as tutorials for the various machine learning methods. We will describe compendia for the retrieval, curation, visualization, analysis and mining of neglected tropical diseases data deposited in ChEMBL-NTD. All compendia will be freely available.|
8. 8. Gentleman, Robert & Duncan Temple Lang, “Statistical Analyses and Reproducible Research” (May 2004) Bioconductor Project Working Papers. Working Paper 2.http://biostats.bepress.com/bioconductor/paper2.