Statistics Theory
New submissions
[ showing up to 2000 entries per page: fewer | more ]
New submissions for Fri, 29 Mar 24
- [1] arXiv:2403.18961 [pdf, other]
-
Title: Spatial confounding under infill asymptoticsSubjects: Statistics Theory (math.ST)
The estimation of regression parameters in spatially referenced data plays a crucial role across various scientific domains. A common approach involves employing an additive regression model to capture the relationship between observations and covariates, accounting for spatial variability not explained by the covariates through a Gaussian random field. While theoretical analyses of such models have predominantly focused on prediction and covariance parameter inference, recent attention has shifted towards understanding the theoretical properties of regression coefficient estimates, particularly in the context of spatial confounding. This article studies the effect of misspecified covariates, in particular when the misspecification changes the smoothness. We analyze the theoretical properties of the generalize least-square estimator under infill asymptotics, and show that the estimator can have counter-intuitive properties. In particular, the estimated regression coefficients can converge to zero as the number of observations increases, despite high correlations between observations and covariates. Perhaps even more surprising, the estimates can diverge to infinity under certain conditions. Through an application to temperature and precipitation data, we show that both behaviors can be observed for real data. Finally, we propose a simple fix to the problem by adding a smoothing step in the regression.
- [2] arXiv:2403.19196 [pdf, other]
-
Title: What Is a Good Imputation Under MAR Missingness?Subjects: Statistics Theory (math.ST)
Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis: Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first develop an identification result, showing that the widely used Multiple Imputation by Chained Equations (MICE) approach indeed identifies the right conditional distributions. This result, together with two illuminating examples, allows us to propose four essential properties a successful MICE imputation method should meet, thus enabling a more principled evaluation of existing methods and more targeted development of new methods. In particular, we introduce a new method that meets 3 out of the 4 criteria. We then discuss and refine ways to rank imputation methods, even in the challenging setting when the true underlying values are not available. The result is a powerful, easy-to-use scoring algorithm to rank missing value imputations under MAR missingness.
- [3] arXiv:2403.19237 [pdf, other]
-
Title: Extreme change-point detectionAuthors: Kevin Bleakley (CELESTE, LMO)Subjects: Statistics Theory (math.ST)
We examine rules for predicting whether a point in $\mathbb{R}$ generated from a 50-50 mixture of two different probability distributions came from one distribution or the other, given limited (or no) information on the two distributions, and, as clues, one point generated randomly from each of the two distributions. We prove that nearest-neighbor prediction does better than chance when we know the two distributions are Gaussian densities without knowing their parameter values. We conjecture that this result holds for general probability distributions and, furthermore, that the nearest-neighbor rule is optimal in this setting, i.e., no other rule can do better than it if we do not know the distributions or do not know their parameters, or both.
- [4] arXiv:2403.19395 [pdf, ps, other]
-
Title: Kernel entropy estimation for linear processes IISubjects: Statistics Theory (math.ST)
Let $X=\{X_n: n\in \mathbb{N}\}$ be a linear process with bounded probability density function $f(x)$. Under certain conditions, we use the kernel estimator \[ \frac{2}{n(n-1)h_n} \sum_{1\le i<j\le n}K\Big(\frac{X_i-X_j}{h_n}\Big) \] to estimate the quadratic functional of $\int_{\mathbb{R}}f^2(x)dx$ of the linear process $X=\{X_n: n\in \mathbb{N}\}$ and improve the corresponding results in [4].
- [5] arXiv:2403.19396 [pdf, other]
-
Title: Persistent Diagram Estimation of Multivariate Piecewise Hölder-continuous SignalsAuthors: Hugo HenneuseComments: 33 pagesSubjects: Statistics Theory (math.ST); Algebraic Topology (math.AT)
To our knowledge, the analysis of convergence rates for persistent diagram estimation from noisy signals had remained limited to lifting signal estimation results through sup norm (or other functional norm) stability theorems. We believe that moving forward from this approach can lead to considerable gains. We illustrate it in the setting of Gaussian white noise model. We examine from a minimax perspective, the inference of persistent diagram (for sublevel sets filtration). We show that for piecewise H\"older-continuous functions, with control over the reach of the discontinuities set, taking the persistent diagram coming from a simple histogram estimator of the signal, permit to achieve the minimax rates known for H\"older-continuous functions.
Cross-lists for Fri, 29 Mar 24
- [6] arXiv:2311.18438 (cross-list from math.OC) [pdf, other]
-
Title: Solution-Set Geometry and Regularization Path of a Nonconvexly Regularized Convex Sparse ModelComments: 53 pages, 10 figures. Submitted to journalSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Signal Processing (eess.SP); Statistics Theory (math.ST)
The generalized minimax concave (GMC) penalty is a nonconvex sparse regularizer which can preserve the overall-convexity of the regularized least-squares problem. In this paper, we focus on a significant instance of the GMC model termed scaled GMC (sGMC), and present various notable findings on its solution-set geometry and regularization path. Our investigation indicates that while the sGMC penalty is a nonconvex extension of the LASSO penalty (i.e., the $\ell_1$-norm), the sGMC model preserves many celebrated properties of the LASSO model, hence can serve as a less biased surrogate of LASSO without losing its advantages. Specifically, for a fixed regularization parameter $\lambda$, we show that the solution-set geometry, solution uniqueness and sparseness of the sGMC model can be characterized in a similar elegant way to the LASSO model (see, e.g., Osborne et al. 2000, R. J. Tibshirani 2013). For a varying $\lambda$, we prove that the sGMC solution set is a continuous polytope-valued mapping of $\lambda$. Most noticeably, our study indicates that similar to LASSO, the minimum $\ell_2$-norm regularization path of the sGMC model is continuous and piecewise linear in $\lambda$. Based on these theoretical results, an efficient regularization path algorithm is proposed for the sGMC model, extending the well-known least angle regression (LARS) algorithm for LASSO. We prove the correctness and finite termination of the proposed algorithm under a mild assumption, and confirm its correctness-in-general-situation, efficiency, and practical utility through numerical experiments. Many results in this study also contribute to the theoretical research of LASSO.
- [7] arXiv:2403.19082 (cross-list from cs.LG) [pdf, other]
-
Title: Enhancing Conformal Prediction Using E-Test StatisticsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
Conformal Prediction (CP) serves as a robust framework that quantifies uncertainty in predictions made by Machine Learning (ML) models. Unlike traditional point predictors, CP generates statistically valid prediction regions, also known as prediction intervals, based on the assumption of data exchangeability. Typically, the construction of conformal predictions hinges on p-values. This paper, however, ventures down an alternative path, harnessing the power of e-test statistics to augment the efficacy of conformal predictions by introducing a BB-predictor (bounded from the below predictor).
- [8] arXiv:2403.19157 (cross-list from math.PR) [pdf, other]
-
Title: Correlation functions between singular values and eigenvaluesSubjects: Probability (math.PR); Mathematical Physics (math-ph); Statistics Theory (math.ST)
Exploiting the explicit bijection between the density of singular values and the density of eigenvalues for bi-unitarily invariant complex random matrix ensembles of finite matrix size we aim at finding the induced probability measure on $j$ eigenvalues and $k$ singular values that we coin $j,k$-point correlation measure. We fully derive all $j,k$-point correlation measures in the simplest cases for one- and two-dimensional matrices. For $n>2$, we find a general formula for the $1,1$-point correlation measure. This formula reduces drastically when assuming the singular values are drawn from a polynomial ensemble, yielding an explicit formula in terms of the kernel corresponding to the singular value statistics. These expressions simplify even further when the singular values are drawn from a P\'{o}lya ensemble and extend known results between their eigenvalue and singular value statistics.
- [9] arXiv:2403.19300 (cross-list from math.PR) [pdf, other]
-
Title: Random Multi-Type Spanning Forests for Synchronization on Sparse GraphsSubjects: Probability (math.PR); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST)
Random diffusions are a popular tool in Monte-Carlo estimations, with well established algorithms such as Walk-on-Spheres (WoS) going back several decades. In this work, we introduce diffusion estimators for the problems of angular synchronization and smoothing on graphs, in the presence of a rotation associated to each edge. Unlike classical WoS algorithms, these estimators allow for global estimations by propagating along the branches of multi-type spanning forests, and we show that they can outperform standard numerical-linear-algebra solvers in challenging instances, depending on the topology and density of the graph.
- [10] arXiv:2403.19516 (cross-list from stat.ML) [pdf, ps, other]
-
Title: Maximum Likelihood Estimation on Stochastic Blockmodels for Directed Graph ClusteringSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Statistics Theory (math.ST)
This paper studies the directed graph clustering problem through the lens of statistics, where we formulate clustering as estimating underlying communities in the directed stochastic block model (DSBM). We conduct the maximum likelihood estimation (MLE) on the DSBM and thereby ascertain the most probable community assignment given the observed graph structure. In addition to the statistical point of view, we further establish the equivalence between this MLE formulation and a novel flow optimization heuristic, which jointly considers two important directed graph statistics: edge density and edge orientation. Building on this new formulation of directed clustering, we introduce two efficient and interpretable directed clustering algorithms, a spectral clustering algorithm and a semidefinite programming based clustering algorithm. We provide a theoretical upper bound on the number of misclustered vertices of the spectral clustering algorithm using tools from matrix perturbation theory. We compare, both quantitatively and qualitatively, our proposed algorithms with existing directed clustering methods on both synthetic and real-world data, thus providing further ground to our theoretical contributions.
Replacements for Fri, 29 Mar 24
- [11] arXiv:2401.12331 (replaced) [pdf, other]
-
Title: Transfer Learning for Functional Mean Estimation: Phase Transition and Adaptive AlgorithmsSubjects: Statistics Theory (math.ST)
- [12] arXiv:2306.15865 (replaced) [pdf, other]
-
Title: Differentially Private Distributed Estimation and LearningComments: Accepted for publication at IISE Transactions (Special issue on Federated, Distributed Learning and Analytics)Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Systems and Control (eess.SY); Statistics Theory (math.ST); Applications (stat.AP); Machine Learning (stat.ML)
[ showing up to 2000 entries per page: fewer | more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, math, recent, 2403, contact, help (Access key information)