**Persistence Homological Statistical Summaries for Ligand-based Virtual Screening**

**Aras Asaad**, Richard Cooper and Paul Finn

*Oxford Drug Design, Oxford Centre for Innovation, Oxford.*

* *

Ligand based virtual screening is a widely used approach in the early stages of a drug discovery project to obtain lead molecules. It seeks to identify compounds in a large database of chemical structures that are the most similar to a known active query molecule. To be effective, the molecular representation used must capture features of the molecules that are related to their biological activities. Algebraic topology provides a promising framework to extract persistent non-linear features via the theory of persistent homology. In this work, we propose persistence statistics (PersStat), a persistent homological representation that encodes topological as well as geometrical information from molecules. The output of persistent homology is called a persistent barcode (PersBar), which encapsulates finite topological invariants (e.g. connected components, loops and cavities) of the geometry generated from the 3D shape of molecules. PersBar is obtained using a process known as filtration, which uses a sequence of distance resolutions to build an inclusion map of Rips complexes to capture the birth and death of topological features. Persistence statistics are generated from PersBar by computing simple statistical summaries such as average, standard deviation, median, mode and interquartile range of birth, death, mid-point and range of bars in the space of persistence barcodes. This vector representation of persistence statistics is then used as input to a machine learning classifier, here we present results from light gradient boosting machines (Light-GBM), to generate a model differentiating active molecules from decoys. We have tested the effectiveness of this approach on DUD-E and an internal dataset from our drug discovery projects.