The more the merrier: data fusion applied to commercially available fast pKa-prediction methods for small molecules
Tuomo Kalliokoski1, Kai Sinervo1
1Orion Pharma, Espoo, Finland
The negative logarithm of the acid dissocation constant Ka (pKa, often also referred as the acidity constant) is an important parameter in the lead identification and optimisation that is closely linked to lipohilicity. In addition to its effect on compound’s biological activity and selectivity, it also has significant influence on other vital properties such as solubility, permeability and ADMET (Absorption, Distribution, Metabolism, Excretion and Toxicity). There are numerous experimental methods with sufficient throughput available, but they can naturally be used only on existing molecules whereas often it would be desirable to know the pKa for a screening compound before the purchase or even more, before embarking on a laborious synthesis process for a completely novel molecule. This is where computational pKa-prediction methods come into play: their main purpose is to estimate pKa for molecules that are not available for the experimental measurement for one reason or the another.
There are several pKa-prediction programs commercially available which all have been trained using different training sets and that use different algorithms. It has been suggested by Avdeef  that the accuracy of the software predictions could be improved by employing multiple prediction tools at once and combining their results using data fusion. We were interested in verifying this hypothesis as we were unable to find any studies using multiple pKa-prediction methods with data fusion in the scientific literature. Furthermore, we were interested in seeing how the commercially available programs perform on our internal compounds.
We investigated the data fusion performance using the models available to us (ADMET-Predictor/S+pKa from Simulations Plus, ACD Percepta/Classic&GALAS from Advanced Chemistry Development, Epik from Schrödinger) both on a public dataset published by Vertex researchers  and on a dataset compiled from our internal database. The datasets were different in their chemical structures and properties. The accuracy of predictions was measured using Median Absolute Error (MAE) and squared correlation coefficient (r2).
Epik’s performance was not satisfactory on the public dataset and thus it was not used in the internal dataset. Data fusion outperformed single models in both datasets when comparing MAEs. Smallest MAE for a single model using public dataset was 0.30 whereas with data fusion MAE of 0.27 was reached. The difference between single methods and data fusion is larger on the internal dataset: smallest MAE of 0.69 vs. data fusion MAE 0.50. In addition to the better accuracy, data fusion enabled predictions for all of the compounds in both datasets (none of the single methods was able to produce a value for all of the compounds in this study).
Our study proves the hypothesis made by Avdeef that using several pKa-prediction methods simultaneously is beneficial. This finding is well in line with the results from the field of ligand-based virtual screening where data fusion has been shown to produce often better results compared to single searches. If one has several of pKa-prediction methods available, then it would be advisable to use them all at once instead of relying on a just one particular piece of software. A. Avdeef. Drug Ionization and Physicochemical Profiling. In Molecular drug properties: Measurement and Prediction (Ed: R. Mannhold), Wiley-VCH Verlag, Weinheim 2008, pp. 53-83.
 L. Settimo, K. Bellman, R. M. A. Knegtel. Comparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm. Res. 2014, 31, 1082-1095.