Sketchy Sketches II: Advances in Extracting Reaction Information from Patents
John Mayfield, Ingvar Lagerstedt and Roger Sayle
Key information in chemical documents is often only communicated as chemical structure diagrams. In born-digital documents such as patents, journal articles and ELNs these are preferably captured by embedding the sketches files (PerkinElemer ChemDraw/ACD ChemSketch/BIOVIA Draw SKC files). Since 2001 the United States Patent Office (USPTO) has redrawn and included 44 million ChemDraw files with chemical patent publications. These provide a rich source of chemical data if they can be extracted and processed correctly.
Unfortunately the required flexibility of sketch formats means they are often used for general purpose vector drawing and naïvely exporting the chemical data is error prone. We have previously presented the fundamental techniques to interpret and cleanup sketch files for the extraction of molecules, reactions, Markush cores, and R-group definitions. These tools have been successfully deployed in Pistachio for the extraction of more than 15 million reactions from patents with 3.6+ million from USPTO sketches.
Here we present some recent insights and techniques used to capture more reactions and associated information including: failed reactions, compound/step labels, and reaction schemes with multiple synthesis routes.
 John May, Daniel Lowe, Roger Sayle. Sketchy Sketches: Hiding Chemistry in Plain Sight. 7th Sheffield Conference on Chemoinformatics. Jul 2016. https://nextmovesoftware.com/talks.html
 John Mayfield, Ingvar Lagerstedt and Roger Sayle. Pistachio. NIH Virtual Workshop on Reaction Informatics. May 2021. https://nextmovesoftware.com/talks.html