Exploring Virtual Libraries using ElectroShape: Efficient ML-guided Descriptor Generation
Richard I. Cooper, Aras Asaad and Paul W. Finn.
Oxford Drug Design
Ultrafast shape recognition (USR) [1] and ElectroShape [2] are well established ligand-based virtual screening methods which measure similarities of molecular conformation and atomic property distribution. These methods avoid the calculation of a superposition of 3D molecular conformations, thereby allowing rapid scanning of very large databases of molecular conformers for similarity. However, preparation of a database of descriptors requires a one-time explicit enumeration of the molecules and their low energy conformations, making it unsuitable for searching very large “synthesize on demand” chemical spaces. These chemical spaces can reach 1026 molecules and beyond [3] and even simple enumeration and storage of the reaction products of such a space would take an unfeasibly long time and amount of storage.
An AI-enabled approach to exploring chemical spaces using automated docking was recently reported [4] wherein a machine learning model predicts docking scores of unseen molecules after training on a small pool of docking results. The predictions are used to augment the pool of docking experiments and these results are used to retrain and improve the model, thereby exploring the chemical space in a way that is more efficient than random sampling and docking.
We have applied a similar protocol to ligand-based virtual screening which avoids the precomputation step required for USR and analogous methods. ML models based on 2D descriptor information are used to estimate molecular similarities between the ElectroShape 5D descriptors of a pair of molecules, and this information helps to select the next pool of molecules for conformer generation. This approach enables efficient exploration of chemical space while minimising unnecessary computation and storage of descriptors. The selection step can be combined with additional constraints and filters specific to the problem being considered.
We demonstrate the efficiency of this method, and the potential for its application to large and unenumerated virtual libraries.
References
[2] Armstrong et. al., J Comput Aided Mol Des 24(9), 789-80 (2010); Armstrong et. al., J Comput Aided Mol Des, 25(8), 785 (2011)
[3] BioSolveIT: Efficient 3D exploration of multi-billion compound spaces. BioSolveIT. https://cactus.nci.nih.gov/presentations/NIHBigDB_2020-12/ChristianLemmen4NIHworkshop.pdf. Accessed 02 Feb 2023
[4] Gentile et. al. Nature Protocols 17, 672-697 (2022)