We gratefully acknowledge support from
the Simons Foundation and member institutions.

Statistics Theory

New submissions

[ total of 16 entries: 1-16 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 30 Oct 20

[1]  arXiv:2010.15351 [pdf, other]
Title: Nonparametric estimation of copulas and copula densities by orthogonal projections
Comments: 42 pages, 6 figures, 9 tables
Subjects: Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)

In this paper we study nonparametric estimators of copulas and copula densities. We first focus our study on a density copula estimator based on a polynomial orthogonal projection of the joint density. A new copula estimator is then deduced. Its asymptotic properties are studied: we provide a large functional class for which this construction is optimal in the minimax and maxiset sense and we propose a method selection for the smoothing parameter. An intensive simulation study shows the very good performance of both copulas and copula densities estimators which we compare to a large panel of competitors. A real dataset in actuarial science illustrates this approach.

[2]  arXiv:2010.15515 [pdf, ps, other]
Title: Staged trees are curved exponential families
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)

Staged tree models are a discrete generalization of Bayesian networks. We show that these form curved exponential families and derive their natural parameters, sufficient statistic, and cumulant-generating function as functions of their graphical representation. We give necessary graphical criteria for classifying regular subfamilies and discuss implications for model selection.

[3]  arXiv:2010.15658 [pdf, other]
Title: Generalization bounds for deep thresholding networks
Comments: 19 pages, 4 figures
Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)

We consider compressive sensing in the scenario where the sparsity basis (dictionary) is not known in advance, but needs to be learned from examples. Motivated by the well-known iterative soft thresholding algorithm for the reconstruction, we define deep networks parametrized by the dictionary, which we call deep thresholding networks. Based on training samples, we aim at learning the optimal sparsifying dictionary and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. The dictionary learning is performed via minimizing the empirical risk. We derive generalization bounds by analyzing the Rademacher complexity of hypothesis classes consisting of such deep networks. We obtain estimates of the sample complexity that depend only linearly on the dimensions and on the depth.

[4]  arXiv:2010.15659 [pdf, ps, other]
Title: Post-selection inference with HSIC-Lasso
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

Detecting influential features in complex (non-linear and/or high-dimensional) datasets is key for extracting the relevant information. Most of the popular selection procedures, however, require assumptions on the underlying data - such as distributional ones -, which barely agree with empirical observations. Therefore, feature selection based on nonlinear methods, such as the model-free HSIC-Lasso, is a more relevant approach. In order to ensure valid inference among the chosen features, the selection procedure must be accounted for. In this paper, we propose selective inference with HSIC-Lasso using the framework of truncated Gaussians together with the polyhedral lemma. Based on these theoretical foundations, we develop an algorithm allowing for low computational costs and the treatment of the hyper-parameter selection issue. The relevance of our method is illustrated using artificial and real-world datasets. In particular, our empirical findings emphasise that type-I error control at the considered level can be achieved.

Cross-lists for Fri, 30 Oct 20

[5]  arXiv:2010.15530 (cross-list from eess.SY) [pdf, other]
Title: Probabilistic interval predictor based on dissimilarity functions
Comments: 8 pages, 4 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Systems and Control (eess.SY); Statistics Theory (math.ST)

This work presents a new method to obtain probabilistic interval predictions of a dynamical system. The method uses stored past system measurements to estimate the future evolution of the system. The proposed method relies on the use of dissimilarity functions to estimate the conditional probability density function of the outputs. A family of empirical probability density functions, parameterized by means of two parameters, is introduced. It is shown that the the proposed family encompasses the multivariable normal probability density function as a particular case. We show that the proposed method constitutes a generalization of classical estimation methods. A cross-validation scheme is used to tune the two parameters on which the methodology relies. In order to prove the effectiveness of the methodology presented, some numerical examples and comparisons are provided.

[6]  arXiv:2010.15539 (cross-list from math.PR) [pdf, other]
Title: Rates of convergence for Gibbs sampling in the analysis of almost exchangeable data
Subjects: Probability (math.PR); Statistics Theory (math.ST)

Motivated by de Finetti's representation theorem for partially exchangeable arrays, we want to sample $\mathbf p \in [0,1]^d$ from a distribution with density proportional to $\exp(-A^2\sum_{i<j}c_{ij}(p_i-p_j)^2)$. We are particularly interested in the case of an almost exchangeable array ($A$ large).
We analyze the rate of convergence of a coordinate Gibbs sampler used to simulate from these measures. We show that for every fixed matrix $C=(c_{ij})$, and large enough $A$, mixing happens in $\Theta(A^2)$ steps in a suitable Wasserstein distance. The upper and lower bounds are explicit and depend on the matrix $C$ through few relevant spectral parameters.

[7]  arXiv:2010.15690 (cross-list from cs.LG) [pdf, other]
Title: Analyzing the tree-layer structure of Deep Forests
Authors: Ludovic Arnould (LPSM UMR 8001), Claire Boyer (LPSM UMR 8001), Erwan Scornet (CMAP)
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST)

Random forests on the one hand, and neural networks on the other hand, have met great success in the machine learning community for their predictive performance. Combinations of both have been proposed in the literature, notably leading to the so-called deep forests (DF) [25]. In this paper, we investigate the mechanisms at work in DF and outline that DF architecture can generally be simplified into more simple and computationally efficient shallow forests networks. Despite some instability, the latter may outperform standard predictive tree-based methods. In order to precisely quantify the improvement achieved by these light network configurations over standard tree learners, we theoretically study the performance of a shallow tree network made of two layers, each one composed of a single centered tree. We provide tight theoretical lower and upper bounds on its excess risk. These theoretical results show the interest of tree-network architectures for well-structured data provided that the first layer, acting as a data encoder, is rich enough.

[8]  arXiv:2010.15764 (cross-list from stat.ML) [pdf, other]
Title: Domain adaptation under structural causal models
Comments: 75 pages, 19 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

Domain adaptation (DA) arises as an important problem in statistical machine learning when the source data used to train a model is different from the target data used to test the model. Recent advances in DA have mainly been application-driven and have largely relied on the idea of a common subspace for source and target data. To understand the empirical successes and failures of DA methods, we propose a theoretical framework via structural causal models that enables analysis and comparison of the prediction performance of DA methods. This framework also allows us to itemize the assumptions needed for the DA methods to have a low target error. Additionally, with insights from our theory, we propose a new DA method called CIRM that outperforms existing DA methods when both the covariates and label distributions are perturbed in the target data. We complement the theoretical analysis with extensive simulations to show the necessity of the devised assumptions. Reproducible synthetic and real data experiments are also provided to illustrate the strengths and weaknesses of DA methods when parts of the assumptions of our theory are violated.

[9]  arXiv:2010.15817 (cross-list from stat.ME) [pdf, other]
Title: Group-regularized ridge regression via empirical Bayes noise level cross-validation
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

Features in predictive models are not exchangeable, yet common supervised models treat them as such. Here we study ridge regression when the analyst can partition the features into $K$ groups based on external side-information. For example, in high-throughput biology, features may represent gene expression, protein abundance or clinical data and so each feature group represents a distinct modality. The analyst's goal is to choose optimal regularization parameters $\lambda = (\lambda_1, \dotsc, \lambda_K)$ -- one for each group. In this work, we study the impact of $\lambda$ on the predictive risk of group-regularized ridge regression by deriving limiting risk formulae under a high-dimensional random effects model with $p\asymp n$ as $n \to \infty$. Furthermore, we propose a data-driven method for choosing $\lambda$ that attains the optimal asymptotic risk: The key idea is to interpret the residual noise variance $\sigma^2$, as a regularization parameter to be chosen through cross-validation. An empirical Bayes construction maps the one-dimensional parameter $\sigma$ to the $K$-dimensional vector of regularization parameters, i.e., $\sigma \mapsto \widehat{\lambda}(\sigma)$. Beyond its theoretical optimality, the proposed method is practical and runs as fast as cross-validated ridge regression without feature groups ($K=1$).

Replacements for Fri, 30 Oct 20

[10]  arXiv:1906.07514 (replaced) [pdf, other]
Title: Bayes Extended Estimators for Curved Exponential Families
Subjects: Statistics Theory (math.ST)
[11]  arXiv:1910.04267 (replaced) [pdf, ps, other]
Title: Subspace Estimation from Unbalanced and Incomplete Data Matrices: $\ell_{2,\infty}$ Statistical Guarantees
Comments: Accepted to Annals of Statistics
Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
[12]  arXiv:2001.09602 (replaced) [pdf, ps, other]
Title: Bayesian Shrinkage Estimation of Negative Multinomial Parameter Vectors
Comments: 31 pages; the code for numerical computation of the hierarchical Bayes estimator in Section 4 has been corrected; Tables 2, 3, and 4 and the second-to-the-last paragraph of Section 4 have been changed
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)
[13]  arXiv:2007.11078 (replaced) [pdf, other]
Title: The Complete Lasso Tradeoff Diagram
Comments: To appear in the 34th Conference on Neural Information Processing Systems (NeurIPS 2020)
Subjects: Statistics Theory (math.ST); Information Theory (cs.IT)
[14]  arXiv:1605.09124 (replaced) [pdf, ps, other]
Title: Minimax Rate-Optimal Estimation of Divergences between Discrete Distributions
Comments: This version has been significantly revised
Subjects: Information Theory (cs.IT); Statistics Theory (math.ST)
[15]  arXiv:2006.08172 (replaced) [pdf, other]
Title: Faster Wasserstein Distance Estimation with the Sinkhorn Divergence
Authors: Lenaic Chizat (LMO), Pierre Roussillon (DMA), Flavien Léger (DMA), François-Xavier Vialard (Univ Gustave Eiffel), Gabriel Peyré (DMA)
Journal-ref: Neural Information Processing Systems, Dec 2020, Vancouver, Canada
Subjects: Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
[16]  arXiv:2010.10436 (replaced) [pdf, other]
Title: VarGrad: A Low-Variance Gradient Estimator for Variational Inference
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
[ total of 16 entries: 1-16 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, math, recent, 2010, contact, help  (Access key information)