Esben Jannik Bjerrum Abstract

Molecular hetero-encoders derived descriptors and their use in QSAR and de novo generation

Panagiotis-Christos Kotsias¹, Hongming Chen¹, Esben Jannik Bjerrum¹

¹Astrazeneca
Chemical hetero-encoders are encoder-decoder based deep learning models with architectures that resemble autoencoder architectures [1]. However hetero-encoders are trained differently and transcode the same molecular information from one format to the other [2] or from one realization of a non-canonical SMILES string to another [3]. After training on large collections of unlabeled molecules (millions), the output from the compressed code layer between the encoder and decoder can be used as molecular descriptors or real-valued fingerprints. The molecular descriptors and the latent space encoded by the code layers depends on the molecular format, the prior dataset and the architectural choices made. In contrast to engineered or designed fingerprints and descriptors they are data driven and learned directly from the molecules in the training dataset.
We use simple LSTM based encoders and decoders trained with teacher forcing of the decoder. But instead of using a translation from canonical SMILES to canonical SMILES, the model is trained from one example of a non-canonical to a different non-canonical SMILES for each sample presented during training using SMILES enumeration as a form of data augmention [4]. This training regime turns the models into a chemical hetero-encoders. The enumerated SMILES to enumerated SMILES training regime is a much harder task than memorizing a canonical SMILES string and reconstructing it, which made changes to the architecture necessary. The hetero-encoder model must encode a latent representation of the molecule from all its possible non-canonical SMILES as well as decode the latent representation into different SMILES of the same molecule. However, this in turn makes the latent space more chemically relevant and breaks the dependency on the non-natural atom ordering of the canonical SMILES. After training, the encoder part can be used to convert molecules into real valued vectors by intercepting the code layers output activations. This latent representation or molecular code contain all the information needed to reconstruct the molecular graph and are thus information complete. The code layer derived vectors are therefore information rich and dense descriptors of molecules and are excellent features for deriving QSAR models from labelled samples from much smaller datasets (hundreds to thousands). Comparing QSAR models build with ECFP4 fingerprints, autoencoder or hetero-encoder derived features, show that the hetero-encoder derived features are superior.
Apart from using the encoder part to derive a latent representation of molecules, the decoder also allows to query what molecules represent a given point in the space defined by the latent descriptors and enables direct inverse QSAR. This is an attractive feature as it allows the decoder to work as a steered de novo generator of molecules around points that have been found optimal in the QSAR models. Here we report on our progresses and experiences for using hetero-encoders for de novo design and optimization of molecules for drug discovery purposes.

[1] Gómez-Bombarelli, Rafael, et al. “Automatic chemical design using a data-driven continuous representation of molecules.” ACS central science 4.2 (2018): 268-276.

[2] Winter, Robin, et al. “Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations.” Chemical Science (2019).

[3] Bjerrum, Esben, and Boris Sattarov. “Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders.” Biomolecules 8.4 (2018): 131.

[4] Bjerrum, Esben Jannik. “SMILES enumeration as data augmentation for neural network modeling of molecules.” arXiv preprint arXiv:1703.07076 (2017).