Estimating Classification Uncertainty for Ensemble ModelsRobert D. Clark1, Wenkel Liang1, Marvin Waldman1, Robert Fraczkiewicz
|1Simulations Plus, Inc.|
|Recent interest in quantitative structure-activity relationships (QSARs) to assist regulatory decision-making  has increased awareness of the need to quantify predictive uncertainty. Research to date has focused on analyzing changes in overall predictivity or on the reliability of quantitative predictions for individual structures [e.g., 2,3]; the reliability of predictive classifications for individual structures has not been as thoroughly explored. We describe a way to estimate the uncertainty in predicted classifications for particular structures from the degree of concordance among the individual QSAR models in an ensemble. The technique was initially developed for classification ensembles in which voting is used to combine predictions, but is also applicable to systems where averaging is used.|
Artificial neural network ensembles (ANNEs) are composed of multiple networks, each of which has the same number of neurons and structure-based descriptor inputs. They are trained independently and so differ in network weights, but they also differ in how observations in the shared training pool are allocated between the set used to train the network (the “training set”) and the verification set used to prevent overtraining. Performance is ultimately assessed against a completely external test set chosen at the beginning of the analysis and set aside during training of the individual networks.
Combining results from individual models into a single prediction is well-documented . Averaging and some form of voting are two of the more popular ways to combine outputs from ensemble members, and we have implemented both.
Responses for structures classified as “negative” and “positive” are converted internally to 0 and 1, respectively, and each of the k networks in the ensemble produces a continuous value between 0 and 1 as output. The individual outputs obtained can be combined to generate an ensemble output of “positive” or “negative” in several ways. One way - the “voting method” - is to identify a threshold ti for each network output xi. Each network output falling below or above the corresponding threshold then contributes a vote of 0 (xi < ti) or 1 (xi > ti) to the ensemble tally. The totaled votes from all networks in the ensemble are compared to an ensemble voting threshold (k/2 by default): a tally falling below the threshold results in a “negative” ensemble prediction and one at or above the threshold results in a “positive” ensemble prediction.
The vote tallies might be expected to follow a binomial distribution for k trials if all individual models were equally valid and the probability of correct prediction was the same for all structures. Neither condition generally holds in practice, however. There is often complete agreement one way or the other about how a given structure should be classified, and that consensus is usually correct. Misclassifications, on the other hand, do follow a roughly binomial distribution, albeit one with degrees of freedom less than k – usually substantially less.
We found that fitting a beta binomial to the distribution of prediction errors from the training pool provides a useful estimate of the prediction uncertainty in classifying “new” external structures, and that the technique can be extended to ensembles in which the individual network outputs are averaged directly, rather than first being converted to votes.
 NL Kruhlak et al., Clin Pharm Therapeutics 2011, 91, 529-534.
 B Beck, A Breindl and T Clark, J Chem Inf Comput Sci 2000, 40, 1046-1051.
 RD Clark, J Cheminfo 2009, 1, 11.