FNGRPRNTS: Processing Just the Bits you Need, and None of the 1s you Don’t.
Roger A. Sayle and John W. Mayfield
NextMove Software, Cambridge, UK
In the 1980s, Daylight Chemical Information Systems revolutionized cheminformatics by demonstrating that the performance of chemical similarity searches could be vastly enhanced by storing binary fingerprints in RAM. Indeed one of the co-founders of Daylight was previously a reseller of memory upgrades for VAX computers. This “in-memory” approach has served well for four decades, where memory sizes have grown as fast as (or faster than) chemical databases. However, the recent interest in ultra-large virtual databases, containing many billions of molecules, and increasing use of cloud computing require a reimagination of traditional cheminformatics search methods.
In this talk, we describe approaches to tackle the challenge of high (storage) latency, when binary fingerprints can no longer fit in the memory of today’s servers. Techniques, applicable to Tanimoto, Tversky and Manhattan distance searches, include spreading the workload of multiple servers, reorganizing fingerprints such that only the subset of bits required by a query are read from disk, and (when possible) taking advantage of the similarities between consecutive queries. Unlike previous methods, that attempt to achieve sublinear search by pruning the search when identifying near neighbors, these new approaches reduce the amount of work required to determine the distance to each molecule in a database. As an extreme example, if the fingerprint for benzene contains only six “on” bits, then only six bits of data for each compound need be fetched from disk during a Tanimoto similarity search against benzene.