FSM’s of Chemical Nomenclature: Minimising State Machines from Chemical Grammars
Michael Blakey1,2 and John Mayfield1
1NextMove Software
2University of Southampton
Finite State Machines (FSM) are an efficient way of representing computational grammars, used for both text mining and spelling correction. IUPAC Chemical Nomenclature is an example of an infinite grammar, and as such, cannot reasonably be represented by standard FSM structures. NextMove’s LeadMine and OPSIN use a two level non-deterministic structure to represent the IUPAC rules, and whilst providing large coverage, the NFA requires complex algorithms to match and spelling correct, naturally leading to slower performance. In theory it is possible to reduce NFA’s to DFA’s using subset and power-set construction, however for chemical nomenclature this algorithm is exponential and not feasible on even high end workstations. We look to train these NFA structures on 20 years worth of patent data to evaluate the rule scope of IUPAC, with aims to reduce machine complexity enough to perform the power-set construction.
NextMoves LeadMine is a text-mining/spelling correction software “suite” for chemical entities. Current performance allows LeadMine to extract and spelling correct 20 years of patents in less than a day, however NFA matching is around 25 times slower than an equivalent DFA algorithm, making a high coverage DFA representation eminently desirable. The performance gain here also directly affects spelling correction speed.
Previously, the grammar needed to represent IUPAC nomenclature has been deemed too complex to try and minimise. This is due to IUPAC rules allowing unlimited combinations of chemical terms within the rule set. Whilst these rules may be allowed, we can analyse large pools of data to see whether they were used in patent histories.
We present a CaffeineWithdrawal, a tool based on NextMove Software’s CaffeineFix. Withdrawal provides basis for training and minimising state machines based on input data, either by state or edge transitions taken. Untaken States or Edges can then be removed in order to reduce the machine size for attempting a NFA to DFA iterative subset/power-set construction. Whilst built on the CaffeineFix FSM structure, these methods apply to any FSM state API. We focus on three grammars used to build the IUPAC rules, and conclude with successful DFA conversion of the smallest.