Solubility Prediction in Organic Solvents through a Combination of Cheminformatics and Computational Chemistry
Samuel Boobier1, David Hose2, John Blacker1, Bao Nguyen1
1University of Leeds, Leeds, LS2 9JT
2AstraZeneca, S41/2 Etherow Building, Charter Way, Macclesfield, Cheshire, SK10 2NA
Artificial Intelligence and Machine Learning continue to flourish, finding ever diverse applications within the chemical sciences. They have been particularly successful in drug discovery and development, with applications including: suggesting potential hit molecules; assessing their physiochemical profiles; and generating potential synthetic routes once an API has been selected. In particular, prediction of properties such as solubility has been highly desirable. Not only is sufficient aqueous solubility generally required for an API to be active in the body, but also understanding solubility is crucial in the development stage. Solvents’ roles go far beyond mediating reactions, and are also important in steps like purification, separation and characterisation. In scaling up a process, avoiding waste and unnecessary steps is paramount and solvent selection is a key consideration. Hence, it would be desirable to develop a tool that can, quickly and accurately, predict solubility across a range of solvents. This could be used by chemists in isolation, or integrated into larger route prediction tool.
Numerous cheminformatic models to predict solubility have been previously developed. They almost exclusively predict water solubility in a drug discovery setting, limiting their use in reactions with non-aqueous solvents. Other limitations of these models include: lack of consensus of which descriptors/machine learning methods to use; and potential risk of overfitting (learning from noise in the datasets and performing poorly on new data) if care is not taken in descriptor selection. This current work builds upon these foundations by using electronic structure methods to develop new descriptors and cutting-edge machine learning to map these descriptors to solubility.
We herein present 4 new datasets in water, ethanol, benzene and acetone, mined from open-access academic databases. A host of 3D descriptors from DFT and semi-empirical PM6 were combined with experimental melting point, designed to capture solution and solid-state interactions. Initial models in water equalled or exceeded the current literature standard, with our water dataset (T = 808, S = 95) producing R2 = 0.91 and RMSE = 0.79 log S using a Support Vector Machine (SVM) method. Cheminformatic models were built for benzene, ethanol and acetone for the first time, with promising results for benzene dataset (T = 370, S = 94) with R2 = 0.75 and RMSE = 0.55 log S, using an Extremely Randomised Trees algorithm. PM6 performed as well as DFT in almost all cases, with computation times of 100x quicker, this means predictions can obtained for multiple molecules within an hour.
We also present novel improvements to these models. New charged surface descriptors, based on DFT calculations at the 90 % isosurface level, were developed as a substitute for point charge descriptors. Neural Network architecture was comprehensively explored. However, tree methods and SVM were found to be superior to any neural network architecture. Predicting how to separate diasteromers by solubility is highly desirable, we therefore present a new dataset of 34 diastereomer pairs. When the solubility difference exceeded 1 log S, the model correctly predicted which diastereomer was more soluble in 100 % of the cases.
We have made significant progress in developing a tool that process chemists use to quickly predict solubility across 4 solvents. We advocate that SVM or Random Forest type methods exceed neural Networks at this time, and have provided insight how to improve these models further through the development of novel charge surface descriptors. We have demonstrated its utility by gathering diastereomer data, and correctly predicting the solubility order if the difference is sufficiently large.