Abstract Details


Poster 25: Interpretation of Statistical Machine Learning Models for Ames Mutagenicity

Samuel Webb1, 2, Thierry Hanser1, Brendan Howlin2, Paul Krause2, Jonathan Vessey1
1Lhasa Limited, 22-23 Blenheim Terrace, Woodhouse Lane, Leeds, LS2 9HD
2University of Surrey, Guildford, Surrey, GU2 7XH
(Q)SAR methodologies too often focus on the improvement of the performance (measured by accuracy, sensitivity and specificity) of predictions for an endpoint rather than and potentially at the expense of the interpretation of the results. Highly accurate predictions should be a goal of any (Q)SAR modeller; however, since published model performance for Ames mutagenicity are converging towards a similar accuracy (c. 85%) which falls in line with the reproducibility of the underlying data, effort may be better utilised by improving the usability of said predictions.

A wide variety of descriptors, learning algorithms and combinations of the two have been investigated and published by the scientific community. The majority of these models provide a confidence value (in various levels of granularity) in addition to a class prediction. Very few indicate why the predicted class was chosen. Our aim has been to maintain the high performance that can be achieved by classification models while adding a new level of depth to the prediction: an explanation of why the class was predicted {active, inactive}. An internal curation of the Hansen benchmark mutagenicity dataset has been used to produce models using a variation of the Random Forest algorithm and highly interpretable fragment based descriptors.

The modelling approach is a stepwise procedure which has mostly been implemented into the KNIME workflow package using a number of new and standard nodes. The first step is to curate the training data before generating the fragment dictionary. For these models the top 1,000 fragments were identified and selected to become the descriptor set. Information on the absence or presence of each fragment in a structure is used to generate a binary fingerprint, where a matched fragment is represented by an active bit at the appropriate position.

The interpretation methodology utilises a binary fingerprint representation of the structure which is generated by the presence and absence of the fragments from the descriptor set. The fingerprint is enumerated to produce combinations from the active bits, for example the fingerprint {0, 4, 6} where fragments 0, 4 and 6 are present in the structure - would also produce the following combinations {0}, {4}, {6}, {0, 4}, {0, 6} and {4, 6} when produced exhaustively. A prediction is made and recorded for each combination. This information is then utilised to build up relationships between the fragments and the impact on the prediction.
This interpretation method allows for the identification of: single fragment causes of activation, fragment combinations resulting in activation as well as deactivations of either of these types. Knowledge of deactivations is not limited to structures where the prediction on the full fingerprint is inactive. It is possible to identify a localised deactivation in addition to a separate activation on the same structure. For example a structure represented by the following fingerprint {0, 1, 2, 3, 4, 5} where 0 is an aromatic nitro fragment, 1 is an aromatic amine fragment, 3 is a carboxylic acid fragment and the remaining fragments are inconsequential may results in the following output: {0} active combination and {1} deactivated by {2}. The bit positions can then be matched to a fragment to allow structure highlighting or event converted to a readable text output using SMILES.

This new interpretation algorithm is statistical learning method independent. Model performance is on par with both commercial and freely available models for this endpoint. Where these models excel is in their ability to provide a detailed explanation of the cause of a prediction. The only perquisite for use is that the data can be represented by a binary fingerprint.

Return to Programme