Building Machine Learning Models Using Relevant Features
Akos Tarcsay, Janos Marx, Laszlo Antal, Gabor Kovacs and Endre Andras
Chemaxon Kft.Váci út 133.1138 Budapest, Hungary
Selecting all relevant descriptors in the context of the labeled data is a fundamental step to build accurate machine learning models. Descriptors are most often generated initially based on the available capabilities. Not all of these variables are relevant for the given task (classification or regression) and their importance is not known in advance. Reducing the number of variables has multiple advantageous effects. Reducing the number of features contributes to faster model training. Accuracy of machine learning algorithms might decrease when the number of features are significantly above the optimal. Models will be more prone to over-fitting, more specific to the training set and less general. Identification of relevant descriptors can be considered as an additional result of the training. Relevant features uncover the underlying mechanism of the built model and support its interpretation and explanation.
We present here the results obtained with implementing the Boruta algorithm in Chemaxon Trainer Engine to select all relevant features. Boruta algorithm [1, 2] is based on the feature importance of the variable. Variables are selected based on the significance of their importance value. The algorithm extends the calculated descriptors with their randomized array. During this step for each descriptor an additional “shadow” descriptor is added by randomization, to preserve original distribution. Models are trained on the extended descriptor set containing the original descriptors and their shadow values. During the iterative process, feature importance values are extracted using ensemble tree methods and original descriptors are consecutively dropped if their importance is not significantly higher compared to the importance of the shadow descriptor pool. This process is iterated until a given cycle count or if there descriptor number stabilizes.
In our study we applied the Boruta algorithm and tested it on a large number of targets to compare the accuracy of the models built with selected relevant, reduced number of features to the accuracy using the full descriptor set. We analyzed biological readout descriptor associations. Since the feature importance is influenced by the hyperparameters, we also investigated the effect of hyperparameters on the feature selection. Hyperparameters that are in connection with regularization were in focus, especially the mTry parameter of the ensemble tree.
The presentation will discuss results on a large set of ChEMBL targets and individual ADMET related targets.