We gratefully acknowledge support from
the Simons Foundation and member institutions.


New submissions

[ total of 27 entries: 1-27 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 18 Jun 21

[1]  arXiv:2106.09071 [pdf, ps, other]
Title: Pre-processing with Orthogonal Decompositions for High-dimensional Explanatory Variables
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)

Strong correlations between explanatory variables are problematic for high-dimensional regularized regression methods. Due to the violation of the Irrepresentable Condition, the popular LASSO method may suffer from false inclusions of inactive variables. In this paper, we propose pre-processing with orthogonal decompositions (PROD) for the explanatory variables in high-dimensional regressions. The PROD procedure is constructed based upon a generic orthogonal decomposition of the design matrix. We demonstrate by two concrete cases that the PROD approach can be effectively constructed for improving the performance of high-dimensional penalized regression. Our theoretical analysis reveals their properties and benefits for high-dimensional penalized linear regression with LASSO. Extensive numerical studies with simulations and data analysis show the promising performance of the PROD.

[2]  arXiv:2106.09100 [pdf, other]
Title: Maximum likelihood estimation for mechanistic network models
Comments: 29 pages, 8 figures
Subjects: Methodology (stat.ME)

Mechanistic network models specify the mechanisms by which networks grow and change, allowing researchers to investigate complex systems using both simulation and analytical techniques. Unfortunately, it is difficult to write likelihoods for instances of graphs generated with mechanistic models because of a combinatorial explosion in outcomes of repeated applications of the mechanism. Thus it is near impossible to estimate the parameters using maximum likelihood estimation. In this paper, we propose treating node sequence in a growing network model as an additional parameter, or as a missing random variable, and maximizing over the resulting likelihood. We develop this framework in the context of a simple mechanistic network model, used to study gene duplication and divergence, and test a variety of algorithms for maximizing the likelihood in simulated graphs. We also run the best-performing algorithm on a human protein-protein interaction network and four non-human protein-protein interaction networks. Although we focus on a specific mechanistic network model here, the proposed framework is more generally applicable to reversible models.

[3]  arXiv:2106.09114 [pdf, other]
Title: Semiparametric count data regression for self-reported mental health
Subjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)

"For how many days during the past 30 days was your mental health not good?" The responses to this question measure self-reported mental health and can be linked to important covariates in the National Health and Nutrition Examination Survey (NHANES). However, these count variables present major distributional challenges: the data are overdispersed, zero-inflated, bounded by 30, and heaped in five- and seven-day increments. To meet these challenges, we design a semiparametric estimation and inference framework for count data regression. The data-generating process is defined by simultaneously transforming and rounding (STAR) a latent Gaussian regression model. The transformation is estimated nonparametrically and the rounding operator ensures the correct support for the discrete and bounded data. Maximum likelihood estimators are computed using an EM algorithm that is compatible with any continuous data model estimable by least squares. STAR regression includes asymptotic hypothesis testing and confidence intervals, variable selection via information criteria, and customized diagnostics. Simulation studies validate the utility of this framework. STAR is deployed to study the factors associated with self-reported mental health and demonstrates substantial improvements in goodness-of-fit compared to existing count data regression models.

[4]  arXiv:2106.09115 [pdf, other]
Title: Clustering inference in multiple groups
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods which assess statistical significance have recently drawn attention owing to their importance for the identification of patterns in high dimensional data with applications in many scientific fields. We present here a U-statistics based approach, specially tailored for high-dimensional data, that clusters the data into three groups while assessing the significance of such partitions. Because our approach stands on the U-statistics based clustering framework of the methods in R package uclust, it inherits its characteristics being a non-parametric method relying on very few assumptions about the data, and thus can be applied to a wide range of dataset. Furthermore our method aims to be a more powerful tool to find the best partitions of the data into three groups when that particular structure is present. In order to do so, we first propose an extension of the test U-statistic and develop its asymptotic theory. Additionally we propose a ternary non-nested significance clustering method. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Applications to peripheral blood mononuclear cells and to image recognition shows the versatility of our proposal, presenting a superior performance when compared with other approaches.

[5]  arXiv:2106.09494 [pdf, other]
Title: Optimum Allocation for Adaptive Multi-Wave Sampling in R: The R Package optimall
Comments: 31 pages, 7 figures
Subjects: Methodology (stat.ME)

The R package optimall offers a collection of functions that efficiently streamline the design process of sampling in surveys ranging from simple to complex. The package's main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum using Neyman or Wright allocation, and select specific IDs to sample based on a stratified sampling design. Using real-life epidemiological study examples, we demonstrate how optimall facilitates an efficient workflow for the design and implementation of surveys in R. Although tailored towards multi-wave sampling under two- or three-phase designs, the R package optimall may be useful for any sampling survey.

[6]  arXiv:2106.09499 [pdf, other]
Title: Maximum Entropy Spectral Analysis: a case study
Comments: 16 pages, 13 figure, submitted to A&A
Subjects: Methodology (stat.ME); Instrumentation and Methods for Astrophysics (astro-ph.IM); Data Analysis, Statistics and Probability (physics.data-an)

The Maximum Entropy Spectral Analysis (MESA) method, developed by Burg, provides a powerful tool to perform spectral estimation of a time-series. The method relies on a Jaynes' maximum entropy principle and provides the means of inferring the spectrum of a stochastic process in terms of the coefficients of some autoregressive process AR($p$) of order $p$. A closed form recursive solution provides an estimate of the autoregressive coefficients as well as of the order $p$ of the process. We provide a ready-to-use implementation of the algorithm in the form of a python package \texttt{memspectrum}. We characterize our implementation by performing a power spectral density analysis on synthetic data (with known power spectral density) and we compare different criteria for stopping the recursion. Furthermore, we compare the performance of our code with the ubiquitous Welch algorithm, using synthetic data generated from the released spectrum by the LIGO-Virgo collaboration. We find that, when compared to Welch's method, Burg's method provides a power spectral density (PSD) estimation with a systematically lower variance and bias. This is particularly manifest in the case of a little number of data points, making Burg's method most suitable to work in this regime.

[7]  arXiv:2106.09632 [pdf, other]
Title: Large-Scale Multiple Testing for Matrix-Valued Data under Double Dependency
Subjects: Methodology (stat.ME); Applications (stat.AP)

High-dimensional inference based on matrix-valued data has drawn increasing attention in modern statistical research, yet not much progress has been made in large-scale multiple testing specifically designed for analysing such data sets. Motivated by this, we consider in this article an electroencephalography (EEG) experiment that produces matrix-valued data and presents a scope of developing novel matrix-valued data based multiple testing methods controlling false discoveries for hypotheses that are of importance in such an experiment. The row-column cross-dependency of observations appearing in a matrix form, referred to as double-dependency, is one of the main challenges in the development of such methods. We address it by assuming matrix normal distribution for the observations at each of the independent matrix data-points. This allows us to fully capture the underlying double-dependency informed through the row- and column-covariance matrices and develop methods that are potentially more powerful than the corresponding one (e.g., Fan and Han (2017)) obtained by vectorizing each data point and thus ignoring the double-dependency. We propose two methods to approximate the false discovery proportion with statistical accuracy. While one of these methods is a general approach under double-dependency, the other one provides more computational efficiency for higher dimensionality. Extensive numerical studies illustrate the superior performance of the proposed methods over the principal factor approximation method of Fan and Han (2017). The proposed methods have been further applied to the aforementioned EEG data.

[8]  arXiv:2106.09633 [pdf, ps, other]
Title: Optimal Relevant Subset Designs in Nonlinear Models
Authors: Adam Lane
Comments: 25 pages, 6 figures, 1 table
Subjects: Methodology (stat.ME)

Fisher (1934) argued that certain ancillary statistics form a relevant subset, a subset of the sample space on which inference should be restricted, and showed that conditioning on their observed value reduces the dimension of the data without a loss of information. The use of ancillary statistics in post-data inference has received significant attention; however, their role in the design of the experiment has not been well characterized. Ancillary statistics are unknown prior to data collection and as a result cannot be incorporated into the design a priori. However, if the data are observed sequentially then the ancillary statistics based on the data from the preceding observations can be used to determine the design assignment for the current observation. The main results of this work describe the benefits of incorporating ancillary statistics, specifically, the ancillary statistic that constitutes a relevant subset, into an adaptive design.

[9]  arXiv:2106.09702 [pdf, other]
Title: Spectral goodness-of-fit tests for complete and partial network data
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Applications (stat.AP); Machine Learning (stat.ML)

Networks describe the, often complex, relationships between individual actors. In this work, we address the question of how to determine whether a parametric model, such as a stochastic block model or latent space model, fits a dataset well and will extrapolate to similar data. We use recent results in random matrix theory to derive a general goodness-of-fit test for dyadic data. We show that our method, when applied to a specific model of interest, provides an straightforward, computationally fast way of selecting parameters in a number of commonly used network models. For example, we show how to select the dimension of the latent space in latent space models. Unlike other network goodness-of-fit methods, our general approach does not require simulating from a candidate parametric model, which can be cumbersome with large graphs, and eliminates the need to choose a particular set of statistics on the graph for comparison. It also allows us to perform goodness-of-fit tests on partial network data, such as Aggregated Relational Data. We show with simulations that our method performs well in many situations of interest. We analyze several empirically relevant networks and show that our method leads to improved community detection algorithms. R code to implement our method is available on Github.

Cross-lists for Fri, 18 Jun 21

[10]  arXiv:2106.09327 (cross-list from eess.SP) [pdf, other]
Title: Minimax Estimation of Partially-Observed Vector AutoRegressions
Authors: Guillaume Dalle (CERMICS), Yohann de Castro (ICJ, ECL)
Subjects: Signal Processing (eess.SP); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)

To understand the behavior of large dynamical systems like transportation networks, one must often rely on measurements transmitted by a set of sensors, for instance individual vehicles. Such measurements are likely to be incomplete and imprecise, which makes it hard to recover the underlying signal of interest.Hoping to quantify this phenomenon, we study the properties of a partially-observed state-space model. In our setting, the latent state $X$ follows a high-dimensional Vector AutoRegressive process $X_t = \theta X_{t-1} + \varepsilon_t$. Meanwhile, the observations $Y$ are given by a noise-corrupted random sample from the state $Y_t = \Pi_t X_t + \eta_t$. Several random sampling mechanisms are studied, allowing us to investigate the effect of spatial and temporal correlations in the distribution of the sampling matrices $\Pi_t$.We first prove a lower bound on the minimax estimation error for the transition matrix $\theta$. We then describe a sparse estimator based on the Dantzig selector and upper bound its non-asymptotic error, showing that it achieves the optimal convergence rate for most of our sampling mechanisms. Numerical experiments on simulated time series validate our theoretical findings, while an application to open railway data highlights the relevance of this model for public transport traffic analysis.

[11]  arXiv:2106.09387 (cross-list from math.ST) [pdf, other]
Title: Taming Nonconvexity in Kernel Feature Selection---Favorable Properties of the Laplace Kernel
Comments: 33 pages main text;
Subjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)

Kernel-based feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernel-based feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernel-based feature selection are nonconvex. The literature has only studied the statistical properties of the \emph{global optima}, which is a mismatch, given that the gradient-based algorithms available for nonconvex optimization are only able to guarantee convergence to local minima. Studying the full landscape associated with kernel-based methods, we show that feature selection objectives using the Laplace kernel (and other $\ell_1$ kernels) come with statistical guarantees that other kernels, including the ubiquitous Gaussian kernel (or other $\ell_2$ kernels) do not possess. Based on a sharp characterization of the gradient of the objective function, we show that $\ell_1$ kernels eliminate unfavorable stationary points that appear when using an $\ell_2$ kernel. Armed with this insight, we establish statistical guarantees for $\ell_1$ kernel-based feature selection which do not require reaching the global minima. In particular, we establish model-selection consistency of $\ell_1$-kernel-based feature selection in recovering main effects and hierarchical interactions in the nonparametric setting with $n \sim \log p$ samples.

[12]  arXiv:2106.09533 (cross-list from cs.IR) [pdf, other]
Title: Author Clustering and Topic Estimation for Short Texts
Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)

Analysis of short text, such as social media posts, is extremely difficult because it relies on observing many document-level word co-occurrence pairs. Beyond topic distributions, a common downstream task of the modeling is grouping the authors of these documents for subsequent analyses. Traditional models estimate the document groupings and identify user clusters with an independent procedure. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc cluster estimation and improving topic estimation by shrinking noisy user-level topic distributions towards typical values. Our method performs as well as -- or better -- than traditional approaches to problems arising in short text, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology.

[13]  arXiv:2106.09597 (cross-list from stat.AP) [pdf, other]
Title: Hierarchical surrogate-based Approximate Bayesian Computation for an electric motor test bench
Subjects: Applications (stat.AP); Optimization and Control (math.OC); Data Analysis, Statistics and Probability (physics.data-an); Methodology (stat.ME)

Inferring parameter distributions of complex industrial systems from noisy time series data requires methods to deal with the uncertainty of the underlying data and the used simulation model. Bayesian inference is well suited for these uncertain inverse problems. Standard methods used to identify uncertain parameters are Markov Chain Monte Carlo (MCMC) methods with explicit evaluation of a likelihood function. However, if the likelihood is very complex, such that its evaluation is computationally expensive, or even unknown in its explicit form, Approximate Bayesian Computation (ABC) methods provide a promising alternative. In this work both methods are first applied to artificially generated data and second on a real world problem, by using data of an electric motor test bench. We show that both methods are able to infer the distribution of varying parameters with a Bayesian hierarchical approach. But the proposed ABC method is computationally much more efficient in order to achieve results with similar accuracy. We suggest to use summary statistics in order to reduce the dimension of the data which significantly increases the efficiency of the algorithm. Further the simulation model is replaced by a Polynomial Chaos Expansion (PCE) surrogate to speed up model evaluations. We proof consistency for the proposed surrogate-based ABC method with summary statistics under mild conditions on the (approximated) forward model.

Replacements for Fri, 18 Jun 21

[14]  arXiv:1607.06565 (replaced) [pdf, other]
Title: Estimating Causal Peer Influence in Homophilous Social Networks by Inferring Latent Locations
Comments: 35 pages, 4 figures
Subjects: Methodology (stat.ME); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
[15]  arXiv:2002.09736 (replaced) [pdf, other]
Title: Model-assisted estimation through random forests in finite population sampling
Subjects: Methodology (stat.ME)
[16]  arXiv:2003.09202 (replaced) [pdf, ps, other]
Title: New statistical model for misreported data with application to current public health challenges
Subjects: Methodology (stat.ME); Applications (stat.AP)
[17]  arXiv:2006.13489 (replaced) [pdf, other]
Title: Unified Principal Component Analysis for Sparse and Dense Functional Data under Spatial Dependency
Subjects: Methodology (stat.ME); Econometrics (econ.EM); Statistics Theory (math.ST); Computation (stat.CO)
[18]  arXiv:2007.12807 (replaced) [pdf, other]
Title: Cross-study learning for generalist and specialist predictions
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
[19]  arXiv:2008.00163 (replaced) [pdf, other]
Title: The Importance of Being Correlated: Implications of Dependence in Joint Spectral Inference across Multiple Networks
Comments: 44 pages, 13 figures
Subjects: Methodology (stat.ME); Machine Learning (stat.ML)
[20]  arXiv:2010.12696 (replaced) [pdf, other]
Title: On Construction and Estimation of Stationary Mixture Transition Distribution Models
Subjects: Methodology (stat.ME)
[21]  arXiv:2103.16810 (replaced) [pdf, other]
Title: An Expectation-Maximization Algorithm for Continuous-time Hidden Markov Models
Subjects: Methodology (stat.ME)
[22]  arXiv:2104.13588 (replaced) [pdf]
Title: Improved log-Gaussian approximation for over-dispersed Poisson regression: application to spatial analysis of COVID-19
Subjects: Methodology (stat.ME); Applications (stat.AP)
[23]  arXiv:2106.06669 (replaced) [pdf, other]
Title: Spatial Bayesian GLM on the cortical surface produces reliable task activations in individuals and groups
Comments: 37 pages, 24 figures
Subjects: Methodology (stat.ME)
[24]  arXiv:1803.03348 (replaced) [pdf, other]
Title: Joint Estimation and Inference for Data Integration Problems based on Multiple Multi-layered Gaussian Graphical Models
Subjects: Machine Learning (stat.ML); Statistics Theory (math.ST); Methodology (stat.ME)
[25]  arXiv:1903.08008 (replaced) [pdf, other]
Title: Rank-normalization, folding, and localization: An improved $\widehat{R}$ for assessing convergence of MCMC
Comments: Two small fixes. Published in Bayesian analysis this https URL
Subjects: Computation (stat.CO); Methodology (stat.ME)
[26]  arXiv:2007.06357 (replaced) [pdf, other]
Title: Feasible Inference for Stochastic Volatility in Brownian Semistationary Processes
Comments: 21 pages, 7 figures
Subjects: Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)
[27]  arXiv:2105.10360 (replaced) [pdf, other]
Title: BELT: Block-wise Missing Embedding Learning Transformer
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)
[ total of 27 entries: 1-27 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2106, contact, help  (Access key information)