Resolving Code Names to Structures from the Medicinal Chemistry Literature: Not as FAIR as it Should Be
Roxana-Maria Rujan, Miguel Amaral, Christopher Southan and Ian Dunlop
Medicines Discovery Catapult, Block 34, Mereside, Alderley Park, Macclesfield, Cheshire, SK10 4ZF
The practice of assigning code names (CNs) as the publicly declared identifiers for distinct lead compound in drug discovery is widespread but remains problematic for biocuration. They are typically used on company web sites, press releases, abstracts, posters, slides, clinical trials and journal articles. The most common approximate form is “XXX-123456”, with letter prefixes for the organization of origin and numbering from an implicit internal registration system. However, they are effectively non-standardized and may include single letter codes, spaces, commas, suffixes, multiple hyphens and CNs too short to have any useful searching specificity. It can also be challenging to resolve and extract the name-to-structure (n2s) from the journal article, especially for image-only representations. Further challenges arise when some CNs are blinded in press releases and clinical trial entries (i.e., there is no open n2s). This work had an initial focus on detecting and curating CNs from the Journal of Medicinal Chemistry. From ~2000 PubMed abstracts ~ 300 codes were identified which could be manually mapped to structures. We also developed an extended regular expression syntax to identify as many CNs as possible automatically from just the abstract text. However extensive specificity tweaking was needed including the compilation of false-positive blacklists corresponding to in many cases to gene and cell line names in the abstracts. While many CNs had n2s matches in PubChem from various submitting sources such as Guide to Pharmacology, BindingDB and ChEMBL others were novel. However, many lead structures remained difficult to map into databases because of trivial non-coded naming (e.g., compound 22b). Causes and amelioration of these curation and FAIRness issues for medicinal chemistry lead compounds will be outlined.