Effects of additional compound and protein information on the performance of Macau.
Elisa Hernández Muñoz1,2, Antonio de la Vega de León1
1Information School, University of Sheffield, S1 4DP Sheffield, UK
2Facultad de Ciencias, Universidad Autónoma de Madrid, Ciudad Universitaria de Cantoblanco, 28049, Madrid, Spain
Macau is a novel machine learning algorithm that uses Bayesian probabilistic matrix factorization in order to predict compounds’ activities in different assays at the same time (it is a multitask method). Matrix factorization transforms the original matrix into two factorized matrices, where rows and columns are related through a latent space. The factorized matrices are optimized to reduce the error on the known values of the original matrix, and can be used to predict missing values on the original matrix. Thanks to this technique, Macau avoids overfitting and gives an estimation of the uncertainty of the predictions, which can be useful when studying the quality of the results. The only data required to obtain the predictions is the activity data set, but Macau is able to incorporate side information of the involved molecules and of the different assays to establish relations among them, in order to achieve a more advantageous matrix factorization. One of its most important capabilities is that it works with large-scale matrices that can reach millions of rows or columns, predicting activities of different assays or proteins at the same time (Simm et al., 2015, arXiv 1509.04610).
Due to the lack of previous studies about Macau’s behaviour with side information, we decided to analyse to what extent it changes its performance, in relation to the results obtained only with the activity data. The data set that was used was the Published Kinase Inhibitor Set (PKIS). The PKIS data set contains the activities that 376 compounds show in 454 different assays with different kinases or in different concentrations. First, Macau was tested without any kind of side information, using just the activity matrix information. Then, information was added, whether about the compounds, which consisted in different kinds of chemical fingerprints, or about the proteins of the assay, which consisted in different numerical vectors related to the proteins’ sequences and amino acid content. Finally, compound and protein side information was tested together. The fingerprints that were tested include the Morgan fingerprints, the atom pair fingerprints, the MACCS fingerprints and the torsional fingerprints. We also studied the influence of a different radius of the fingerprint in the results.
The results confirmed that Macau’s predictions are more precise when side information is added. Nevertheless, the magnitude of the improvement depended on the kind of fingerprint or protein descriptor used and was related to the quantity of information that they held. The best performance was obtained when using Morgan fingerprints (equivalent to the extended connectivity fingerprints, or ECFP) of the largest radius combined with the protein descriptors that included the most information about the sequences. On the other hand, we observed that improving Macau’s results, entails a increase in the time needed to perform. The increase of the time needed for the combination of side information with the greatest magnitude of improvement was remarkably bigger than the increase of time observed for similar combinations with slightly worse results. In conclusion, we achieved a better characterization of this novel technique, which could benefit future studies of Macau since, nowadays, only a few studies about it have been published and it is a promising multitask machine learning technique.