Methodology
New submissions
[ showing up to 2000 entries per page: fewer  more ]
New submissions for Fri, 18 Jun 21
 [1] arXiv:2106.09071 [pdf, ps, other]

Title: Preprocessing with Orthogonal Decompositions for Highdimensional Explanatory VariablesSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
Strong correlations between explanatory variables are problematic for highdimensional regularized regression methods. Due to the violation of the Irrepresentable Condition, the popular LASSO method may suffer from false inclusions of inactive variables. In this paper, we propose preprocessing with orthogonal decompositions (PROD) for the explanatory variables in highdimensional regressions. The PROD procedure is constructed based upon a generic orthogonal decomposition of the design matrix. We demonstrate by two concrete cases that the PROD approach can be effectively constructed for improving the performance of highdimensional penalized regression. Our theoretical analysis reveals their properties and benefits for highdimensional penalized linear regression with LASSO. Extensive numerical studies with simulations and data analysis show the promising performance of the PROD.
 [2] arXiv:2106.09100 [pdf, other]

Title: Maximum likelihood estimation for mechanistic network modelsComments: 29 pages, 8 figuresSubjects: Methodology (stat.ME)
Mechanistic network models specify the mechanisms by which networks grow and change, allowing researchers to investigate complex systems using both simulation and analytical techniques. Unfortunately, it is difficult to write likelihoods for instances of graphs generated with mechanistic models because of a combinatorial explosion in outcomes of repeated applications of the mechanism. Thus it is near impossible to estimate the parameters using maximum likelihood estimation. In this paper, we propose treating node sequence in a growing network model as an additional parameter, or as a missing random variable, and maximizing over the resulting likelihood. We develop this framework in the context of a simple mechanistic network model, used to study gene duplication and divergence, and test a variety of algorithms for maximizing the likelihood in simulated graphs. We also run the bestperforming algorithm on a human proteinprotein interaction network and four nonhuman proteinprotein interaction networks. Although we focus on a specific mechanistic network model here, the proposed framework is more generally applicable to reversible models.
 [3] arXiv:2106.09114 [pdf, other]

Title: Semiparametric count data regression for selfreported mental healthSubjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
"For how many days during the past 30 days was your mental health not good?" The responses to this question measure selfreported mental health and can be linked to important covariates in the National Health and Nutrition Examination Survey (NHANES). However, these count variables present major distributional challenges: the data are overdispersed, zeroinflated, bounded by 30, and heaped in five and sevenday increments. To meet these challenges, we design a semiparametric estimation and inference framework for count data regression. The datagenerating process is defined by simultaneously transforming and rounding (STAR) a latent Gaussian regression model. The transformation is estimated nonparametrically and the rounding operator ensures the correct support for the discrete and bounded data. Maximum likelihood estimators are computed using an EM algorithm that is compatible with any continuous data model estimable by least squares. STAR regression includes asymptotic hypothesis testing and confidence intervals, variable selection via information criteria, and customized diagnostics. Simulation studies validate the utility of this framework. STAR is deployed to study the factors associated with selfreported mental health and demonstrates substantial improvements in goodnessoffit compared to existing count data regression models.
 [4] arXiv:2106.09115 [pdf, other]

Title: Clustering inference in multiple groupsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods which assess statistical significance have recently drawn attention owing to their importance for the identification of patterns in high dimensional data with applications in many scientific fields. We present here a Ustatistics based approach, specially tailored for highdimensional data, that clusters the data into three groups while assessing the significance of such partitions. Because our approach stands on the Ustatistics based clustering framework of the methods in R package uclust, it inherits its characteristics being a nonparametric method relying on very few assumptions about the data, and thus can be applied to a wide range of dataset. Furthermore our method aims to be a more powerful tool to find the best partitions of the data into three groups when that particular structure is present. In order to do so, we first propose an extension of the test Ustatistic and develop its asymptotic theory. Additionally we propose a ternary nonnested significance clustering method. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Applications to peripheral blood mononuclear cells and to image recognition shows the versatility of our proposal, presenting a superior performance when compared with other approaches.
 [5] arXiv:2106.09494 [pdf, other]

Title: Optimum Allocation for Adaptive MultiWave Sampling in R: The R Package optimallComments: 31 pages, 7 figuresSubjects: Methodology (stat.ME)
The R package optimall offers a collection of functions that efficiently streamline the design process of sampling in surveys ranging from simple to complex. The package's main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum using Neyman or Wright allocation, and select specific IDs to sample based on a stratified sampling design. Using reallife epidemiological study examples, we demonstrate how optimall facilitates an efficient workflow for the design and implementation of surveys in R. Although tailored towards multiwave sampling under two or threephase designs, the R package optimall may be useful for any sampling survey.
 [6] arXiv:2106.09499 [pdf, other]

Title: Maximum Entropy Spectral Analysis: a case studyComments: 16 pages, 13 figure, submitted to A&ASubjects: Methodology (stat.ME); Instrumentation and Methods for Astrophysics (astroph.IM); Data Analysis, Statistics and Probability (physics.dataan)
The Maximum Entropy Spectral Analysis (MESA) method, developed by Burg, provides a powerful tool to perform spectral estimation of a timeseries. The method relies on a Jaynes' maximum entropy principle and provides the means of inferring the spectrum of a stochastic process in terms of the coefficients of some autoregressive process AR($p$) of order $p$. A closed form recursive solution provides an estimate of the autoregressive coefficients as well as of the order $p$ of the process. We provide a readytouse implementation of the algorithm in the form of a python package \texttt{memspectrum}. We characterize our implementation by performing a power spectral density analysis on synthetic data (with known power spectral density) and we compare different criteria for stopping the recursion. Furthermore, we compare the performance of our code with the ubiquitous Welch algorithm, using synthetic data generated from the released spectrum by the LIGOVirgo collaboration. We find that, when compared to Welch's method, Burg's method provides a power spectral density (PSD) estimation with a systematically lower variance and bias. This is particularly manifest in the case of a little number of data points, making Burg's method most suitable to work in this regime.
 [7] arXiv:2106.09632 [pdf, other]

Title: LargeScale Multiple Testing for MatrixValued Data under Double DependencySubjects: Methodology (stat.ME); Applications (stat.AP)
Highdimensional inference based on matrixvalued data has drawn increasing attention in modern statistical research, yet not much progress has been made in largescale multiple testing specifically designed for analysing such data sets. Motivated by this, we consider in this article an electroencephalography (EEG) experiment that produces matrixvalued data and presents a scope of developing novel matrixvalued data based multiple testing methods controlling false discoveries for hypotheses that are of importance in such an experiment. The rowcolumn crossdependency of observations appearing in a matrix form, referred to as doubledependency, is one of the main challenges in the development of such methods. We address it by assuming matrix normal distribution for the observations at each of the independent matrix datapoints. This allows us to fully capture the underlying doubledependency informed through the row and columncovariance matrices and develop methods that are potentially more powerful than the corresponding one (e.g., Fan and Han (2017)) obtained by vectorizing each data point and thus ignoring the doubledependency. We propose two methods to approximate the false discovery proportion with statistical accuracy. While one of these methods is a general approach under doubledependency, the other one provides more computational efficiency for higher dimensionality. Extensive numerical studies illustrate the superior performance of the proposed methods over the principal factor approximation method of Fan and Han (2017). The proposed methods have been further applied to the aforementioned EEG data.
 [8] arXiv:2106.09633 [pdf, ps, other]

Title: Optimal Relevant Subset Designs in Nonlinear ModelsAuthors: Adam LaneComments: 25 pages, 6 figures, 1 tableSubjects: Methodology (stat.ME)
Fisher (1934) argued that certain ancillary statistics form a relevant subset, a subset of the sample space on which inference should be restricted, and showed that conditioning on their observed value reduces the dimension of the data without a loss of information. The use of ancillary statistics in postdata inference has received significant attention; however, their role in the design of the experiment has not been well characterized. Ancillary statistics are unknown prior to data collection and as a result cannot be incorporated into the design a priori. However, if the data are observed sequentially then the ancillary statistics based on the data from the preceding observations can be used to determine the design assignment for the current observation. The main results of this work describe the benefits of incorporating ancillary statistics, specifically, the ancillary statistic that constitutes a relevant subset, into an adaptive design.
 [9] arXiv:2106.09702 [pdf, other]

Title: Spectral goodnessoffit tests for complete and partial network dataSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Applications (stat.AP); Machine Learning (stat.ML)
Networks describe the, often complex, relationships between individual actors. In this work, we address the question of how to determine whether a parametric model, such as a stochastic block model or latent space model, fits a dataset well and will extrapolate to similar data. We use recent results in random matrix theory to derive a general goodnessoffit test for dyadic data. We show that our method, when applied to a specific model of interest, provides an straightforward, computationally fast way of selecting parameters in a number of commonly used network models. For example, we show how to select the dimension of the latent space in latent space models. Unlike other network goodnessoffit methods, our general approach does not require simulating from a candidate parametric model, which can be cumbersome with large graphs, and eliminates the need to choose a particular set of statistics on the graph for comparison. It also allows us to perform goodnessoffit tests on partial network data, such as Aggregated Relational Data. We show with simulations that our method performs well in many situations of interest. We analyze several empirically relevant networks and show that our method leads to improved community detection algorithms. R code to implement our method is available on Github.
Crosslists for Fri, 18 Jun 21
 [10] arXiv:2106.09327 (crosslist from eess.SP) [pdf, other]

Title: Minimax Estimation of PartiallyObserved Vector AutoRegressionsSubjects: Signal Processing (eess.SP); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
To understand the behavior of large dynamical systems like transportation networks, one must often rely on measurements transmitted by a set of sensors, for instance individual vehicles. Such measurements are likely to be incomplete and imprecise, which makes it hard to recover the underlying signal of interest.Hoping to quantify this phenomenon, we study the properties of a partiallyobserved statespace model. In our setting, the latent state $X$ follows a highdimensional Vector AutoRegressive process $X_t = \theta X_{t1} + \varepsilon_t$. Meanwhile, the observations $Y$ are given by a noisecorrupted random sample from the state $Y_t = \Pi_t X_t + \eta_t$. Several random sampling mechanisms are studied, allowing us to investigate the effect of spatial and temporal correlations in the distribution of the sampling matrices $\Pi_t$.We first prove a lower bound on the minimax estimation error for the transition matrix $\theta$. We then describe a sparse estimator based on the Dantzig selector and upper bound its nonasymptotic error, showing that it achieves the optimal convergence rate for most of our sampling mechanisms. Numerical experiments on simulated time series validate our theoretical findings, while an application to open railway data highlights the relevance of this model for public transport traffic analysis.
 [11] arXiv:2106.09387 (crosslist from math.ST) [pdf, other]

Title: Taming Nonconvexity in Kernel Feature SelectionFavorable Properties of the Laplace KernelComments: 33 pages main text;Subjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
Kernelbased feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernelbased feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernelbased feature selection are nonconvex. The literature has only studied the statistical properties of the \emph{global optima}, which is a mismatch, given that the gradientbased algorithms available for nonconvex optimization are only able to guarantee convergence to local minima. Studying the full landscape associated with kernelbased methods, we show that feature selection objectives using the Laplace kernel (and other $\ell_1$ kernels) come with statistical guarantees that other kernels, including the ubiquitous Gaussian kernel (or other $\ell_2$ kernels) do not possess. Based on a sharp characterization of the gradient of the objective function, we show that $\ell_1$ kernels eliminate unfavorable stationary points that appear when using an $\ell_2$ kernel. Armed with this insight, we establish statistical guarantees for $\ell_1$ kernelbased feature selection which do not require reaching the global minima. In particular, we establish modelselection consistency of $\ell_1$kernelbased feature selection in recovering main effects and hierarchical interactions in the nonparametric setting with $n \sim \log p$ samples.
 [12] arXiv:2106.09533 (crosslist from cs.IR) [pdf, other]

Title: Author Clustering and Topic Estimation for Short TextsSubjects: Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Analysis of short text, such as social media posts, is extremely difficult because it relies on observing many documentlevel word cooccurrence pairs. Beyond topic distributions, a common downstream task of the modeling is grouping the authors of these documents for subsequent analyses. Traditional models estimate the document groupings and identify user clusters with an independent procedure. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with userlevel topic distributions. We also simultaneously cluster users, removing the need for posthoc cluster estimation and improving topic estimation by shrinking noisy userlevel topic distributions towards typical values. Our method performs as well as  or better  than traditional approaches to problems arising in short text, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology.
 [13] arXiv:2106.09597 (crosslist from stat.AP) [pdf, other]

Title: Hierarchical surrogatebased Approximate Bayesian Computation for an electric motor test benchSubjects: Applications (stat.AP); Optimization and Control (math.OC); Data Analysis, Statistics and Probability (physics.dataan); Methodology (stat.ME)
Inferring parameter distributions of complex industrial systems from noisy time series data requires methods to deal with the uncertainty of the underlying data and the used simulation model. Bayesian inference is well suited for these uncertain inverse problems. Standard methods used to identify uncertain parameters are Markov Chain Monte Carlo (MCMC) methods with explicit evaluation of a likelihood function. However, if the likelihood is very complex, such that its evaluation is computationally expensive, or even unknown in its explicit form, Approximate Bayesian Computation (ABC) methods provide a promising alternative. In this work both methods are first applied to artificially generated data and second on a real world problem, by using data of an electric motor test bench. We show that both methods are able to infer the distribution of varying parameters with a Bayesian hierarchical approach. But the proposed ABC method is computationally much more efficient in order to achieve results with similar accuracy. We suggest to use summary statistics in order to reduce the dimension of the data which significantly increases the efficiency of the algorithm. Further the simulation model is replaced by a Polynomial Chaos Expansion (PCE) surrogate to speed up model evaluations. We proof consistency for the proposed surrogatebased ABC method with summary statistics under mild conditions on the (approximated) forward model.
Replacements for Fri, 18 Jun 21
 [14] arXiv:1607.06565 (replaced) [pdf, other]

Title: Estimating Causal Peer Influence in Homophilous Social Networks by Inferring Latent LocationsComments: 35 pages, 4 figuresSubjects: Methodology (stat.ME); Social and Information Networks (cs.SI); Physics and Society (physics.socph)
 [15] arXiv:2002.09736 (replaced) [pdf, other]

Title: Modelassisted estimation through random forests in finite population samplingSubjects: Methodology (stat.ME)
 [16] arXiv:2003.09202 (replaced) [pdf, ps, other]

Title: New statistical model for misreported data with application to current public health challengesSubjects: Methodology (stat.ME); Applications (stat.AP)
 [17] arXiv:2006.13489 (replaced) [pdf, other]

Title: Unified Principal Component Analysis for Sparse and Dense Functional Data under Spatial DependencySubjects: Methodology (stat.ME); Econometrics (econ.EM); Statistics Theory (math.ST); Computation (stat.CO)
 [18] arXiv:2007.12807 (replaced) [pdf, other]

Title: Crossstudy learning for generalist and specialist predictionsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
 [19] arXiv:2008.00163 (replaced) [pdf, other]

Title: The Importance of Being Correlated: Implications of Dependence in Joint Spectral Inference across Multiple NetworksAuthors: Konstantinos Pantazis, Avanti Athreya, Jesús Arroyo, William N. Frost, Evan S. Hill, Vince LyzinskiComments: 44 pages, 13 figuresSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
 [20] arXiv:2010.12696 (replaced) [pdf, other]

Title: On Construction and Estimation of Stationary Mixture Transition Distribution ModelsSubjects: Methodology (stat.ME)
 [21] arXiv:2103.16810 (replaced) [pdf, other]

Title: An ExpectationMaximization Algorithm for Continuoustime Hidden Markov ModelsSubjects: Methodology (stat.ME)
 [22] arXiv:2104.13588 (replaced) [pdf]

Title: Improved logGaussian approximation for overdispersed Poisson regression: application to spatial analysis of COVID19Subjects: Methodology (stat.ME); Applications (stat.AP)
 [23] arXiv:2106.06669 (replaced) [pdf, other]

Title: Spatial Bayesian GLM on the cortical surface produces reliable task activations in individuals and groupsComments: 37 pages, 24 figuresSubjects: Methodology (stat.ME)
 [24] arXiv:1803.03348 (replaced) [pdf, other]

Title: Joint Estimation and Inference for Data Integration Problems based on Multiple Multilayered Gaussian Graphical ModelsSubjects: Machine Learning (stat.ML); Statistics Theory (math.ST); Methodology (stat.ME)
 [25] arXiv:1903.08008 (replaced) [pdf, other]

Title: Ranknormalization, folding, and localization: An improved $\widehat{R}$ for assessing convergence of MCMCComments: Two small fixes. Published in Bayesian analysis this https URLSubjects: Computation (stat.CO); Methodology (stat.ME)
 [26] arXiv:2007.06357 (replaced) [pdf, other]

Title: Feasible Inference for Stochastic Volatility in Brownian Semistationary ProcessesComments: 21 pages, 7 figuresSubjects: Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)
 [27] arXiv:2105.10360 (replaced) [pdf, other]

Title: BELT: Blockwise Missing Embedding Learning TransformerSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)
[ showing up to 2000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2106, contact, help (Access key information)