Estimating Error Rates in Bioactivity DatabasesPekka Tiikkainen1, Lutz Franke1
|1Merz Pharmaceuticals GmbH|
|Bioactivity databases are routinely used in drug discovery to look-up and, using prediction tools, to predict activities of small molecules. These databases are typically manually curated from patents and scientific articles. Apart from errors in the source document, the human factor can cause errors during the extraction process. The errors can lead to wrong decisions in the early drug discovery process. In our earlier work , we cross-analyzed databases to identify potentially erroneous data points but no database-specific error rate estimates could be given. In the current work, we have analyzed data shared by three large bioactivity databases (ChEMBL , Evolvus  and WOMBAT ) to provide more detailed error rate estimates.|
We started by identifying articles cited by all three databases. For each such article, bioactivity data extracted by the different databases were compared. Taking one data variable at a time (pivot variable), activities with identical values in the remaining variables were grouped together. For each group, the pivot values were compared and the database with a discrepant pivot value was assumed to be incorrect. Iterating over the different activity variables, we could calculate both database- and variable-specific error rate estimates.
While databases have roughly equal error rate estimates, the activity variables had large differences in their estimated error rates. Molecule structures had the highest error frequency, followed by the molecular target, the activity value and the activity type.
 Tiikkainen P and Franke L. Analysis of Commercial and Public Bioactivity Databases. J. Chem. Inf. Model., 2012, 52 (2), pp 319–326
 ChEMBL database. European Bioinformatics Institute. https://www.ebi.ac.uk/chembl/
 Evolvus bioactivity database. Evolvus. http://www.evolvus.com/di.htm
 WOMBAT database. Sunset Molecular Discovery Ltd. http://www.sunsetmolecular.com/