Varkony Revisited: Subgraph Enumeration and the Multiple MCS ProblemAndrew Dalke1
|1Andrew Dalke Scientific AB|
|In this presentation I show how subgraph enumeration, commonly used in molecular substructure mining, can be used to solve the multi-MCS problem. The basic approach was first described in the paper by Varkony, Shiloach and Smith (JCICS 1979) but has mostly fallen into disfavor in the MCS literature. It was independently discovered by the substructure mining field where it is most often used for set discrimination and clustering. Neither approach makes use of the modern understanding that chemical substructure searches are fast, nor build on widely-used, tested, and optimized cheminformatics toolkits.|
I have implemented a new package called fmcs. It is written in Python using the RDKit Python/C++ cheminformatics toolkit. It enumerates subgraphs of chosen reference structures to generate SMARTS patterns that are tested against the other structures in the data set. If enough of the SMARTS tests against the other structures pass, then the structure is a sufficiently common substructure, and it is allowed to grow into its superstructures, where the process repeats.
This approach, combined with the heuristics common to other MCS methods, has very good performance. I have developed a set of MCS performance benchmarks and found that the fmcs performance is roughly the same as other MCS packages written purely in Java or C, which means the algorithm is quite competitive given that much of the time is spent in Python code.
The usual clique-based or backtracking solutions to the multi-MCS problem is fragile. A single misclassified structure out of a set of 100 can cause a drastic change to the output MCS, and there's no way to detect which structure caused the problem. The subgraph enumeration approach used in fmcs and mining tools like MoSS handles this naturally, because it is easy to find the largest substructure which is common to some threshold fraction of the compounds, then use the result to identify which structures did not match.
To test if this is useful, I have clustered the nodes in the ChEBI structure ontology, which uses curator-based assignments, and developed a web-based tool to show the changes in MCS membership across different thresholds. This has helped identify several different classes of mis-classification in the data, including issues with aromaticity assignment and nomenclature interpretation.
The fmcs approach, while similar in spirit to Varkony et al. and to substructure mining tools, actually takes a different direction. The latter two methods depend on fragment canonicalization in order to reduce the number of ways to map, say, a benzene to a benzene. This algorithm design choice, while appropriate for general graphs where subgraph isomorphism search is NP-hard, is not as relevant for small molecules, where substructure matches are generally quite fast. In addition, the MCS problem really only needs to know if a given structure exists at least once, and doesn't need to find all 12 possible correspondences between benzene and benzene. In fact, fmcs does not require a canonicalization step, though in practice a semi-canonical fragment SMARTS improves the overall performance. It depends instead on the existing RDKit substructure match machinery.
The connection between substructure mining and the MCS problem may seem surprising, but it shouldn't be regarded as an oddity. I believe that substructure enumeration, especially when combined with canonical fragment SMARTS generation, has a few other surprises in store for us. I will end with a couple of examples of what those might be.