Improved Compound Library Enhancement with Artificial Intelligence Algorithms from Computer Chess.
Roger A. Sayle1, John W. Mayfield1, Noel M. O’Boyle1, Nicolas Zorn2
1NextMove Software Limited, Cambridge, UK
2F. Hoffmann-La Roche Ltd, Basel, Switzerland
As the number of a small molecules available for purchase increases dramatically, the task of maintaining a screening collection of diverse/representative compounds grows ever more challenging. Currently (in 2019), there are around a billion small organic compounds available from chemical vendors, requiring ever more efficient algorithms and hardware for performing diversity selection and compound profiling. In real-world applications of diversity selection within the pharmaceutical industry, this task is further complicated by both the existence of the current/previous screening collection of several million compounds (which get depleted over time and may no longer be available or optimally desirable) and the desire to sample different regions of chemical space (such as kinase inhibitors or peptides) with different densities. For example, as much of chemical space as possible should be covered by cheap compounds, and more expensive compounds only used to fill any remaining interstitial voids.
The MaxMin algorithm [1,2] is frequently used in cheminformatics to pick diverse sets of compounds. In 2017, we (RS) developed a significant algorithmic improvement to traditional MaxMin selection, dramatically reducing the number of (Tanimoto) comparisons required, thereby significantly increasing the size of data sets that can be processed, and the resulting implementation was contributed to the open source RDKit toolkit. A major source of improvement was the application of an AI technique called alpha-beta pruning more commonly encountered in programs for playing chess or Go.
This poster describes the latest advances in this technique and its application to compound collection enhancement. The key insight to our approach is that in practice diversity selection is actually a multi-objective optimization; without additional constraints diversity picking tends to initially select “wacky” molecules, choosing obscure chemical functionality and enriching for errors in molecular representation (“broken” molecules). Hence we formulate diversity selection as an operation over three compound sets; selecting from available compounds those that are (most) dissimilar to an existing in-house screening collection, but are maximally similar to a reference set of desirable compounds (say known kinase inhibitors). The goal is therefore not to select the most diverse compounds in the entirety of chemical space, but to uniformly sample from within a constrained “drug-like” space. This approach can be visualized as a scatter plot with novelty (similarity to the current collection) on the X-axis, and desirability (similarity to prototype ideal compounds) on the Y-axis. That alpha-beta pruning can be applied to one axis and not the other, leads to an interesting asymmetry, but enables the efficient identification of compounds on (or near) the Pareto frontier (i.e. those to be considered for purchasing) in a fraction of the computational effort previously required. Efficient CPU and GPU implementation performance figures will be presented. The ability to vary both the reference set of desirable compounds and the number of compounds selected, allows chemical space to be sampled with customizable non-uniformity with more samples in areas of interest, and fewer samples in more speculative regions of chemical space. This ability to trade-off exploration against exploitation is related to the multi-arm bandit problem in game theory.
[1] M. Ashton, J. Barnard, G. Downs, J. Holliday, P. Willett et al., “Identification of Diverse Database Subsets using Property-based and Fragment-based Molecular Descriptors”, Quantitative Structure-Activity Relationships, 21(6):598-604, 2002.[2] R. Kennard and L. Stone, “Computer Aided Design of Experiments”, Technometrics, 11(1):137-148, 1969.