Tuomo Kalliokoski Abstract

Efficient Structure-based Virtual Screening of Ultra-large Enumerated Chemical Spaces using macHine leArning booSTEd dockiNg (HASTEN)

Tuomo Kalliokoski1, Ainoleena Turku1, Toni Sivula2, Ina Pöhner2, Antti Poso2,3 and Heikki Käsnänen1

1Orion Pharma, Orionintie 1A, 02101 Espoo, Finland
2School of Pharmacy, University of Eastern Finland, Kuopio, Finland
3Department of Internal Medicine VIII, University Hospital Tübingen, Tübingen, Germany


Structure-based virtual screening (docking) is one of the standard workhorses of the hit identification in drug discovery projects. While docking calculations are relatively quick to perform with the currently available computers, brute-force docking becomes cumbersome rather quickly when one is screening ultra-large enumerated chemical spaces such as Enamine REAL which contain billions of compounds. For example, docking of 1.56 billion compounds from the Enamine REAL Lead-Like subset to a single protein structure with Glide HTVS using 640 CPU cores took 85 days which is clearly too long for an industrial drug discovery project timeline, and it is also too expensive when looking at the cost of computing (Sivula et al, Manuscript in preparation).

One alternative to tackle this problem is to use machine learning to cut down the number of docked compounds (predicting docking score from SMILES is much faster than performing the actual docking calculation). A software called macHine leArning booSTEd dockiNg (HASTEN) was developed to enable a speed up of docking-based virtual screening campaigns (Kalliokoski, Mol. Inform. 2021). It is freely available from https://github.com/TuomoKalliokoski/HASTEN.

Brute-force docking of 1.56 billion compounds against two different academic drug discovery targets (SurA and GAK) was performed to produce the ground truth data for the machine learning development. Using this ultra-large data set, HASTEN was able to retrieve 85-90% of the compounds with high docking score just by docking 0.5%-1% of the database.
HASTEN has been optimized to allow for the screening of ultra-large enumerated databases in project-relevant timeframes with desktop computers. For an industrial example from Orion Pharma, the 4.1 billion compound version of Enamine REAL was screened with HASTEN against an oncology target. 0.2% of the 4.1 billion compounds were docked with GlideSP which produced approximately 100000 high-scoring compounds (estimated recall of 0.50).

While HASTEN is already a production-ready tool for virtual screening, there are several adjustable parameters which can be further optimized for even better performance. These and lessons learned from actual screening campaigns will be also discussed in this presentation.