2003 JMLR JMLR 2003

Overfitting in Making Comparisons Between Variable Selection Methods

Abstract

This paper addresses a common methodological flaw in the comparison of variable selection methods. A practical approach to guide the search or the selection process is to compute cross-validation performance estimates of the different variable subsets. Used with computationally intensive search algorithms, these estimates may overfit and yield biased predictions. Therefore, they cannot be used reliably to compare two selection methods, as is shown by the empirical results of this paper. Instead, like in other instances of the model selection problem, independent test sets should be used for determining the final performance. The claims made in the literature about the superiority of more exhaustive search algorithms over simpler ones are also revisited, and some of them infirmed. [abs] [pdf] [ps.gz] [ps]

📈 Trend Setter — Statistical Learning
🧭 Keyword Pioneer — feature subset
🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning
🌱 Topic Pioneer — Evaluation
🐣 Hot Topic Early Bird — model selection

Authors