We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

stat.ML

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Statistics > Machine Learning

Title: To Bag is to Prune

Abstract: It is notoriously difficult to build a bad Random Forest (RF). Concurrently, RF is perhaps the only standard ML algorithm that blatantly overfits in-sample without any consequence out-of-sample. Standard arguments cannot rationalize this paradox. I propose a new explanation: bootstrap aggregation and model perturbation as implemented by RF automatically prune a (latent) true underlying tree. More generally, randomized ensembles of greedily optimized learners implicitly perform optimal early stopping out-of-sample. So there is no need to tune the stopping point. By construction, variants of Boosting and MARS are also eligible for automatic tuning. I empirically demonstrate the property, with simulated and real data, by reporting that these new completely overfitting ensembles yield an out-of-sample performance equivalent to that of their tuned counterparts -- or better.
Comments: added references; corrected typos; added NN discussions and results
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
Cite as: arXiv:2008.07063 [stat.ML]
  (or arXiv:2008.07063v3 [stat.ML] for this version)

Submission history

From: Philippe Goulet Coulombe [view email]
[v1] Mon, 17 Aug 2020 02:45:32 GMT (446kb,D)
[v2] Mon, 14 Sep 2020 04:10:02 GMT (460kb,D)
[v3] Fri, 5 Mar 2021 16:54:07 GMT (842kb,D)
[v4] Tue, 8 Jun 2021 21:54:35 GMT (1061kb,D)

Link back to: arXiv, form interface, contact.