2018 JMLR JMLR 2018

To Tune or Not to Tune the Number of Trees in Random Forest

Abstract

The number of trees $T$ in the random forest (RF) algorithm for supervised learning has to be set by the user. It is unclear whether $T$ should simply be set to the largest computationally manageable value or whether a smaller $T$ may be sufficient or in some cases even better. While the principle underlying bagging is that more trees are better, in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees. The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting $T$ to a computationally feasible large number as long as classical error measures based on average loss are considered. [abs] [ pdf ][ bib ] © JMLR 2018. (edit, beta)

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio
🐣 Hot Topic Early Bird — decision tree