Poster 33: The FPS Format and the Chemfp PackageAndrew Dalke1
|1Andrew Dalke Scientific AB|
|Fingerprint records are perhaps the simplest data set in cheminformatics. They contain only a identifier and a set of bits. Yet underneath that simplicity are many details. Can the identifier be an IUPAC compound name containing a space? What about a tab, or newline? Are the bits represented as '0' and '1' characters? Hex-encoded? Base64? Are they written as little-endian, big-endian, or in mixed-mode? What type of fingerprint is it? What happens if the fingerprint length isn't a multiple of the word size? And this is completely ignoring questions about sparse fingerprints or count fingerprints!|
This poster will attempt to convince you to use the FPS format instead of making up your own variation.
It's a very simple format. There are some optional header lines which start with a "#", followed by the fingerprints. The header contains key/value pairs like "type=RDKit-Fingerprint/1 minPath=1 maxPath=7" or "num_bits=166". The fingerprint data contains tab-separated columns. The first column is always the hex-encoded fingerprint and the second is always the identifier.
You can read the FPS specification for all of the details, but I think you have the idea. The FPS format is, to my knowledge, the only well-documented exchange format for fingerprint data sets. In fact, I'm hard pressed to come up any other fingerprint format used by more than the originating group. Instead, there's a lot of special purpose code to handle each different format.
Why should you use the FPS format? Well, for one, that means you don't need to figure out the format details yourself. The big advantages come when you start using multiple packages which all support the same format. You can export data from RDKit, analyze it with R, import it into CACTVS, or use it in Knime, all without writing any converters.
To make it even more enticing, I've developed the chemfp package. It contains a set of of command-line tools to generate FPS fingerprint files using Open Babel, RDKit, or OEChem, another tool to extracting fingerprints from SD file tags, and another one do to search.
And the search is fast! If you have a modern CPU, it will use the hardware POPCNT instruction. Even with older computers, it still has a few tricks for getting good performance out of the hardware. On top of that, it uses a sub-linear search, memory aligned data, and pre-computed popcounts, and more. Ask me for a demo, and I'll show you how I can find the k=100 nearest neighbors in ChEMBL with subsecond performance on my laptop. It's fast enough to get the results as I input the structure. Even for hard problems, like finding the self-similarity matrix between 2 million fingerprints, with a Tanimoto threshold of 0.8, only takes a few hours on a 15 processor machine.
Are you're more a programmer than a command-line user? Then switch over to the chemfp library for Python. You'll get all of the above capabilities in a documented API.
Did I mention that all of this is available for no cost under the BSD license?
Of course, you can write your own GPU search code, or implement MinHash, or develop a new fingerprint method without using chemfp at all. But with the FPS format you have a way for those pieces to talk with each other and with the outside world, without having to write yet another boring fingerprint format translator.