Common Mistakes in Building QSAR ModelsDamjan Krstajic1, Ljubomir J Buturovic2, David E Leahy3, Simon Thomas4
|1Research Centre for Cheminformatics, Serbia|
2Pathwork Diagnostics, USA
4Cyprotex Discovery Ltd, UK
|The origins of high-dimensional statistics are in QSAR. QSAR scientists were amongst the first to work with datasets where the number of variables is larger than number of samples (n<<p). Thus, it is thanks to QSAR research that we have today dimension reduction techniques such as PLS.|
However, in the last decade or so, the bioinformatics community has begun to work in similar settings. Building good models on microarray data held out the prospects of creating new diagnostic tools and understanding diseases. Therefore, significant research effort has been invested in high-dimensional statistics, leading to new methodologies and criteria for building and assessing good predictive models. We would like to inform the community about the importance of using the latest findings in high-dimensional statistics when building QSAR.
In our talk we will outline some common method mistakes, to which we, too, were prone (1). In particular, we would like to explain and emphasize:
A) The importance of performing feature selection within, not prior to, cross-validation (2, 3) and our solution for achieving this.
B) The importance of executing nested cross-validation for assessing prediction error (4).
Furthermore, we question the use of some error statistics regularly used in QSAR, but which to our knowledge are not applied in other fields of statistics, and we propose more relevant error statistics for QSAR modelling.
1. J Cartmell, S Enoch, D Krstajic, D E Leahy (2005) Automated QSPR through Competitive Workflow., Journal of computer-aided molecular design 19(11), 821-33
2. Hastie T, Tibshirani R, Friedman J (2009): The Elements of Statistical Learning (Data mining, Inference and Prediction), Springer
3. Ambroise C, McLachlan GJ (2002): Selection bias in gene extraction on the basis of microarray gene-expression data, PNAS 99(10)
4. Varma S, Simon R (2006): Bias in error estimation when using cross-validation for model selection, BMC Bioinformatics 7:91