References & Citations
Statistics > Methodology
Title: Partitioned Cross-Validation for Divide-and-Conquer Density Estimation
(Submitted on 31 Aug 2016)
Abstract: We present an efficient method to estimate cross-validation bandwidth parameters for kernel density estimation in very large datasets where ordinary cross-validation is rendered highly inefficient, both statistically and computationally. Our approach relies on calculating multiple cross-validation bandwidths on partitions of the data, followed by suitable scaling and averaging to return a partitioned cross-validation bandwidth for the entire dataset. The partitioned cross-validation approach produces substantial computational gains over ordinary cross-validation. We additionally show that partitioned cross-validation can be statistically efficient compared to ordinary cross-validation. We derive analytic expressions for the asymptotically optimal number of partitions and study its finite sample accuracy through a detailed simulation study. We additionally propose a permuted version of partitioned cross-validation which attains even higher efficiency. Theoretical properties of the estimators are studied and the methodology is applied to the Higgs Boson dataset with 11 million observations
Submission history
From: Anirban Bhattacharya [view email][v1] Wed, 31 Aug 2016 23:01:21 GMT (581kb,D)
Link back to: arXiv, form interface, contact.