John Mayfield Abstract

The secrets of fast SMARTS matching

John Mayfield1, Roger Sayle1

1NextMove Software Ltd
Whether you are searching a database, filtering sets of compounds, normalising for registration, calculating descriptors, or involved in fragment-based drug discovery you will need to perform some form of SMARTS or SMARTS-like match (i.e. subgraph isomorphism). Considered a “fundamental pilar” in cheminformatics, almost all toolkits have a SMARTS utility to fulfil these needs but their efficiency is often far from ideal. With publicly available databases growing in size up to and above a billion records [1,2] the need for ever more efficient algorithms is more prevalent than ever.

In database searching inefficiencies in the SMARTS matcher have previously been tackled with a fingerprint pre-screen. Although this is an effective strategy for distinctive queries it does not help with the worst case where each molecule needs to loaded and verified with an atom match. For databases on the order of one billion molecules even distinctive queries may retrieve millions of records from the pre-screen that must be loaded and verified with an atom match. Furthermore, a pre-screen does not help in other use-cases such as normalisation or filtering where multiple patterns are tested on each molecule. To address the worst case situation for large databases, our research has primarily focussed on improving the atom matching algorithms and file formats.

Previously we have presented [3,4] how atom and bond expressions can be optimised and compiled to reduce computational work and how the ordering of these expressions affects performance. The sequence in which these expressions are traversed and matched is critical to the efficiency of a match yet is often an arbitrary storage order. We will demonstrate issues with the popular VF2 algorithm [5,6] in how it selects atoms and spends unnecessary time computing constraints with little benefit. We will show how a breath- or depth-first matcher[7] selects candidate atoms faster than VF2 and introduce a novel best-first algorithm. For improved efficiency the algorithm can be defined as a set of custom op-codes that are interpreted on a virtual machine. These op-codes simplify extension and handling of complex query features including component grouping, stereochemistry, and recursive patterns.

For searching large databases, improvements to the algorithms alone can only get you so far. When reading from common formats like SMILES/MOLfile the total time spent in the SMARTS match is dwarfed by the time spent allocating, parsing, and preparing (e.g. ring perception and aromaticity) the molecule. A common approach is to serialise the molecule data structures to a binary format but these still require parsing and memory allocation. Our starting point was to work backwards and ask “What would a format optimally designed for SMARTS matching look like?”. To answer this we co-designed the binary format (ATDB) that uses an atom type table to compactly store molecules using one byte per atom, requires no parsing or memory allocation, and allows atom and bond expressions to be matched in constant time.

[1] Enamine REAL, currently 720M compounds (https://enamine.net/).
[2] ZINC15, currently 1.367B compounds (http://zinc15.docking.org).
[3] Roger Sayle. 1st-class SMARTS patterns. Daylight EMUG. 1997. (http://www.daylight.com/meetings/emug97/Sayle/)
[4] Roger Sayle. Efficient matching of multiple chemical subgraphs. 9th ICCS. June 2011. (https://www.nextmovesoftware.com/talks/Sayle_MultipleSmarts_ICCS_201106.pdf)
[5] LP Cordella, P Foggia, C Sansone, M Vento. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2004. 26 (10). 1367-1372.
[6] HC Ehrlich and M Rarey. Systematic benchmark of substructure search in molecular graphs – From Ullmann to VF2. J Cheminform. 2012. 4 (13).
[7] LC Ray and RA Kirsch. Finding Chemical Records by Digital Computers. Science. 1957. 126.