We gratefully acknowledge support from
the Simons Foundation and member institutions.

Methodology

New submissions

[ total of 28 entries: 1-28 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Mon, 27 Sep 21

[1]  arXiv:2109.11634 [pdf, other]
Title: Joint Estimation and Inference for Multi-Experiment Networks of High-Dimensional Point Processes
Authors: Xu Wang, Ali Shojaie
Comments: 49 pages, 9 figures
Subjects: Methodology (stat.ME); Machine Learning (stat.ML)

Modern high-dimensional point process data, especially those from neuroscience experiments, often involve observations from multiple conditions and/or experiments. Networks of interactions corresponding to these conditions are expected to share many edges, but also exhibit unique, condition-specific ones. However, the degree of similarity among the networks from different conditions is generally unknown. Existing approaches for multivariate point processes do not take these structures into account and do not provide inference for jointly estimated networks. To address these needs, we propose a joint estimation procedure for networks of high-dimensional point processes that incorporates easy-to-compute weights in order to data-adaptively encourage similarity between the estimated networks. We also propose a powerful hierarchical multiple testing procedure for edges of all estimated networks, which takes into account the data-driven similarity structure of the multi-experiment networks. Compared to conventional multiple testing procedures, our proposed procedure greatly reduces the number of tests and results in improved power, while tightly controlling the family-wise error rate. Unlike existing procedures, our method is also free of assumptions on dependency between tests, offers flexibility on p-values calculated along the hierarchy, and is robust to misspecification of the hierarchical structure. We verify our theoretical results via simulation studies and demonstrate the application of the proposed procedure using neuronal spike train data.

[2]  arXiv:2109.11705 [pdf, other]
Title: Dimension-Grouped Mixed Membership Models for Multivariate Categorical Data
Subjects: Methodology (stat.ME)

Mixed Membership Models (MMMs) are a popular family of latent structure models for complex multivariate data. Instead of forcing each subject to belong to a single cluster, MMMs incorporate a vector of subject-specific weights characterizing partial membership across clusters. With this flexibility come challenges in uniquely identifying, estimating, and interpreting the parameters. In this article, we propose a new class of Dimension-Grouped MMMs (Gro-M$^3$s) for multivariate categorical data, which improve parsimony and interpretability. In Gro-M$^3$s, observed variables are partitioned into groups such that the latent membership is constant across variables within a group but can differ across groups. Traditional latent class models are obtained when all variables are in one group, while traditional MMMs are obtained when each variable is in its own group. The new model corresponds to a novel decomposition of probability tensors. Theoretically, we propose transparent identifiability conditions for both the unknown grouping structure and the associated model parameters in general settings. Methodologically, we propose a Bayesian approach for Dirichlet Gro-M$^3$s to inferring the variable grouping structure and estimating model parameters. Simulation results demonstrate good computational performance and empirically confirm the identifiability results. We illustrate the new methodology through an application to a functional disability dataset.

[3]  arXiv:2109.11717 [pdf, other]
Title: Analysis of Ordinal Populations from Judgment Post-Stratification
Comments: 25 pages, 12 Figures, 2 Tables
Subjects: Methodology (stat.ME)

In surveys requiring cost efficiency, such as medical research, measuring the variable of interest (e.g., disease status) is expensive and/or time-consuming; However, we often have access to easily attainable characteristics about sampling units. These characteristics are not typically employed in the data collection process. Judgment post-stratification (JPS) sampling enables us to supplement the random samples from the population of interest with these characteristics as ranking information. In this paper, we develop methods based on JPS samples for the estimation of categorical ordinal populations. We develop various estimators from JPS data even for a situation that JPS suffers from empty strata. We also propose JPS estimators using multiple ranking resources. Through extensive numerical studies, we evaluate the performance of the methods in the estimation of the population. Finally, the developed estimation methods are applied to bone mineral data to estimate the bone disorder status of patients aged 50 and older.

[4]  arXiv:2109.11727 [pdf, other]
Title: Smoothing splines approximation using Hilbert curve basis selection
Subjects: Methodology (stat.ME)

Smoothing splines have been used pervasively in nonparametric regressions. However, the computational burden of smoothing splines is significant when the sample size $n$ is large. When the number of predictors $d\geq2$, the computational cost for smoothing splines is at the order of $O(n^3)$ using the standard approach. Many methods have been developed to approximate smoothing spline estimators by using $q$ basis functions instead of $n$ ones, resulting in a computational cost of the order $O(nq^2)$. These methods are called the basis selection methods. Despite algorithmic benefits, most of the basis selection methods require the assumption that the sample is uniformly-distributed on a hyper-cube. These methods may have deteriorating performance when such an assumption is not met. To overcome the obstacle, we develop an efficient algorithm that is adaptive to the unknown probability density function of the predictors. Theoretically, we show the proposed estimator has the same convergence rate as the full-basis estimator when $q$ is roughly at the order of $O[n^{2d/\{(pr+1)(d+2)\}}\quad]$, where $p\in[1,2]$ and $r\approx 4$ are some constants depend on the type of the spline. Numerical studies on various synthetic datasets demonstrate the superior performance of the proposed estimator in comparison with mainstream competitors.

[5]  arXiv:2109.11761 [pdf, other]
Title: Sequentially valid tests for forecast calibration
Subjects: Methodology (stat.ME)

Forecasting and forecast evaluation are inherently sequential tasks. Predictions are often issued on a regular basis, such as every hour, day, or month, and their quality is monitored continuously. However, the classical statistical tools for forecast evaluation are static, in the sense that statistical tests for forecast calibration are only valid if the evaluation period is fixed in advance. Recently, e-values have been introduced as a new, dynamic method for assessing statistical significance. An e-value is a non-negative random variable with expected value at most one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a conservative p-value. E-values are particularly suitable for sequential forecast evaluation, since they naturally lead to statistical tests which are valid under optional stopping. This article proposes e-values for testing probabilistic calibration of forecasts, which is one of the most important notions of calibration. The proposed methods are also more generally applicable for sequential goodness-of-fit testing. We demonstrate that the e-values are competitive in terms of power when compared to extant methods, which do not allow sequential testing. Furthermore, they provide important and useful insights in the evaluation of probabilistic weather forecasts.

[6]  arXiv:2109.11795 [pdf, other]
Title: Scalable Bayesian high-dimensional local dependence learning
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

In this work, we propose a scalable Bayesian procedure for learning the local dependence structure in a high-dimensional model where the variables possess a natural ordering. The ordering of variables can be indexed by time, the vicinities of spatial locations, and so on, with the natural assumption that variables far apart tend to have weak correlations. Applications of such models abound in a variety of fields such as finance, genome associations analysis and spatial modeling. We adopt a flexible framework under which each variable is dependent on its neighbors or predecessors, and the neighborhood size can vary for each variable. It is of great interest to reveal this local dependence structure by estimating the covariance or precision matrix while yielding a consistent estimate of the varying neighborhood size for each variable. The existing literature on banded covariance matrix estimation, which assumes a fixed bandwidth cannot be adapted for this general setup. We employ the modified Cholesky decomposition for the precision matrix and design a flexible prior for this model through appropriate priors on the neighborhood sizes and Cholesky factors. The posterior contraction rates of the Cholesky factor are derived which are nearly or exactly minimax optimal, and our procedure leads to consistent estimates of the neighborhood size for all the variables. Another appealing feature of our procedure is its scalability to models with large numbers of variables due to efficient posterior inference without resorting to MCMC algorithms. Numerical comparisons are carried out with competitive methods, and applications are considered for some real datasets.

[7]  arXiv:2109.11870 [pdf, other]
Title: Quantification of empirical determinacy: the impact of likelihood weighting on posterior location and spread in Bayesian meta-analysis estimated with JAGS and INLA
Comments: 22 pages, 1 figure
Subjects: Methodology (stat.ME); Computation (stat.CO)

The popular Bayesian meta-analysis expressed by Bayesian normal-normal hierarchical model (NNHM) synthesizes knowledge from several studies and is highly relevant in practice. Moreover, NNHM is the simplest Bayesian hierarchical model (BHM), which illustrates problems typical in more complex BHMs. Until now, it has been unclear to what extent the data determines the marginal posterior distributions of the parameters in NNHM. To address this issue we computed the second derivative of the Bhattacharyya coefficient with respect to the weighted likelihood, defined the total empirical determinacy (TED), the proportion of the empirical determinacy of location to TED (pEDL), and the proportion of the empirical determinacy of spread to TED (pEDS). We implemented this method in the R package \texttt{ed4bhm} and considered two case studies and one simulation study. We quantified TED, pEDL and pEDS under different modeling conditions such as model parametrization, the primary outcome, and the prior. This clarified to what extent the location and spread of the marginal posterior distributions of the parameters are determined by the data. Although these investigations focused on Bayesian NNHM, the method proposed is applicable more generally to complex BHMs.

[8]  arXiv:2109.11904 [pdf, ps, other]
Title: Proximal mediation analysis
Comments: 60 pages, 3 figures
Subjects: Methodology (stat.ME)

A common concern when trying to draw causal inferences from observational data is that the measured covariates are insufficiently rich to account for all sources of confounding. In practice, many of the covariates may only be proxies of the latent confounding mechanism. Recent work has shown that in certain settings where the standard 'no unmeasured confounding' assumption fails, proxy variables can be leveraged to identify causal effects. Results currently exist for the total causal effect of an intervention, but little consideration has been given to learning about the direct or indirect pathways of the effect through a mediator variable. In this work, we describe three separate proximal identification results for natural direct and indirect effects in the presence of unmeasured confounding. We then develop a semiparametric framework for inference on natural (in)direct effects, which leads us to locally efficient, multiply robust estimators.

[9]  arXiv:2109.11989 [pdf, other]
Title: Correcting Conditional Mean Imputation for Censored Covariates and Improving Usability
Comments: 8 pages, 2 figures
Subjects: Methodology (stat.ME)

Analysts are often confronted with censoring, wherein some variables are not observed at their true value, but rather at a value that is known to fall above or below that truth. While much attention has been given to the analysis of censored outcomes, contemporary focus has shifted to censored covariates, as well. Missing data is often overcome using multiple imputation, which leverages the entire dataset by replacing missing values with informed placeholders, and this method can be modified for censored data by also incorporating partial information from censored values. One such modification involves replacing censored covariates with their conditional means given other fully observed information, such as the censored value or additional covariates. So-called conditional mean imputation approaches were proposed for censored covariates in Atem et al. [2017], Atem et al.[2019a], and Atem et al. [2019b]. These methods are robust to additional parametric assumptions on the censored covariate and utilize all available data, which is appealing. As we worked to implement these methods, however, we discovered that these three manuscripts provide nonequivalent formulas and, in fact, none is the correct formula for the conditional mean. Herein, we derive the correct form of the conditional mean and demonstrate the impact of the incorrect formulas on the imputed values and statistical inference. Under several settings considered, using an incorrect formula is seen to seriously bias parameter estimation in simple linear regression. Lastly, we provide user-friendly R software, the imputeCensoRd package, to enable future researchers to tackle censored covariates in their data.

[10]  arXiv:2109.11990 [pdf, other]
Title: Optimization-based Causal Estimation from Heterogenous Environments
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)

This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association to the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments -- and ones that exhibit sufficient heterogeneity -- CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model.

[11]  arXiv:2109.12069 [pdf, other]
Title: Towards a Paradigmatic Shift in Pre-election Polling Adequately Including Still Undecided Voters -- Some Ideas Based on Set-Valued Data for the 2021 German Federal Election
Comments: 13 pages, 11 figures
Subjects: Methodology (stat.ME); Applications (stat.AP)

Within this paper we develop and apply new methodology adequately including undecided voters for the 2021 German federal election. Due to a cooperation with the polling institute Civey, we are in the fortunate position to obtain data in which undecided voters can state all the options they are still pondering between. In contrast to conventional polls, forcing the undecided to either state a single party or to drop out, this design allows the undecided to provide their current position in an accurate and precise way. The resulting set-valued information can be used to examine structural properties of groups undecided between specific parties as well as to improve election forecasting. For forecasting, this partial information provides valuable additional knowledge, and the uncertainty induced by the participants' ambiguity can be conveyed within interval-valued results. Turning to coalitions of parties, which is in the core of the current public discussion in Germany, some of this uncertainty can be dissolved as the undecided provide precise information on corresponding coalitions. We show structural differences between the decided and undecided with discrete choice models as well as elaborate the discrepancy between the conventional approach and our new ones including the undecided. Our cautious analysis further demonstrates that in most cases the undecideds' eventual decisions are pivotal which coalitions could hold a majority of seats. Overall, accounting for the populations' ambiguity leads to more credible results and paints a more holistic picture of the political landscape, pathing the way for a possible paradigmatic shift concerning the adequate inclusion of undecided voters in pre-election polls.

Cross-lists for Mon, 27 Sep 21

[12]  arXiv:2109.11612 (cross-list from cs.LG) [pdf, other]
Title: Regret Lower Bound and Optimal Algorithm for High-Dimensional Contextual Linear Bandit
Subjects: Machine Learning (cs.LG); Methodology (stat.ME)

In this paper, we consider the multi-armed bandit problem with high-dimensional features. First, we prove a minimax lower bound, $\mathcal{O}\big((\log d)^{\frac{\alpha+1}{2}}T^{\frac{1-\alpha}{2}}+\log T\big)$, for the cumulative regret, in terms of horizon $T$, dimension $d$ and a margin parameter $\alpha\in[0,1]$, which controls the separation between the optimal and the sub-optimal arms. This new lower bound unifies existing regret bound results that have different dependencies on T due to the use of different values of margin parameter $\alpha$ explicitly implied by their assumptions. Second, we propose a simple and computationally efficient algorithm inspired by the general Upper Confidence Bound (UCB) strategy that achieves a regret upper bound matching the lower bound. The proposed algorithm uses a properly centered $\ell_1$-ball as the confidence set in contrast to the commonly used ellipsoid confidence set. In addition, the algorithm does not require any forced sampling step and is thereby adaptive to the practically unknown margin parameter. Simulations and a real data analysis are conducted to compare the proposed method with existing ones in the literature.

[13]  arXiv:2109.11647 (cross-list from econ.EM) [pdf, other]
Title: Treatment Effects in Market Equilibrium
Comments: 61 pages, 1 figure
Subjects: Econometrics (econ.EM); Methodology (stat.ME)

In evaluating social programs, it is important to measure treatment effects within a market economy, where interference arises due to individuals buying and selling various goods at the prevailing market price. We introduce a stochastic model of potential outcomes in market equilibrium, where the market price is an exposure mapping. We prove that average direct and indirect treatment effects converge to interpretable mean-field treatment effects, and provide estimators for these effects through a unit-level randomized experiment augmented with randomization in prices. We also provide a central limit theorem for the estimators that depends on the sensitivity of outcomes to prices. For a variant where treatments are continuous, we show that the sum of direct and indirect effects converges to the total effect of a marginal policy change. We illustrate the coverage and consistency properties of the estimators in simulations of different interventions in a two-sided market.

[14]  arXiv:2109.11679 (cross-list from stat.ML) [pdf, other]
Title: Safe Policy Learning through Extrapolation: Application to Pre-trial Risk Assessment
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

Algorithmic recommendations and decisions have become ubiquitous in today's society. Many of these and other data-driven policies are based on known, deterministic rules to ensure their transparency and interpretability. This is especially true when such policies are used for public policy decision-making. For example, algorithmic pre-trial risk assessments, which serve as our motivating application, provide relatively simple, deterministic classification scores and recommendations to help judges make release decisions. Unfortunately, existing methods for policy learning are not applicable because they require existing policies to be stochastic rather than deterministic. We develop a robust optimization approach that partially identifies the expected utility of a policy, and then finds an optimal policy by minimizing the worst-case regret. The resulting policy is conservative but has a statistical safety guarantee, allowing the policy-maker to limit the probability of producing a worse outcome than the existing policy. We extend this approach to common and important settings where humans make decisions with the aid of algorithmic recommendations. Lastly, we apply the proposed methodology to a unique field experiment on pre-trial risk assessments. We derive new classification and recommendation rules that retain the transparency and interpretability of the existing risk assessment instrument while potentially leading to better overall outcomes at a lower cost.

[15]  arXiv:2109.11827 (cross-list from math.PR) [pdf, ps, other]
Title: Approximations of Piecewise Deterministic Markov Processes and their convergence properties
Subjects: Probability (math.PR); Methodology (stat.ME)

Piecewise deterministic Markov processes (PDMPs) are a class of stochastic processes with applications in several fields of applied mathematics spanning from mathematical modeling of physical phenomena to computational methods. A PDMP is specified by three characteristic quantities: the deterministic motion, the law of the random event times, and the jump kernels. The applicability of PDMPs to real world scenarios is currently limited by the fact that these processes can be simulated only when these three characteristics of the process can be simulated exactly. In order to overcome this problem, we introduce discretisation schemes for PDMPs which make their approximate simulation possible. In particular, we design both first order and higher order schemes that rely on approximations of one or more of the three characteristics. For the proposed approximation schemes we study both pathwise convergence to the continuous PDMP as the step size converges to zero and convergence in law to the invariant measure of the PDMP in the long time limit. Moreover, we apply our theoretical results to several PDMPs that arise from the computational statistics and mathematical biology literature.

[16]  arXiv:2109.12006 (cross-list from stat.AP) [pdf, ps, other]
Title: A comprehensive review of variable selection in high-dimensional regression for molecular biology
Comments: 15 pages, 5 tables
Subjects: Applications (stat.AP); Methodology (stat.ME)

Variable selection methods are widely used in molecular biology to detect biomarkers or to infer gene regulatory networks from transcriptomic data. Methods are mainly based on the high-dimensional Gaussian linear regression model and we focus on this framework for this review. We propose a comparison study of variable selection procedures from regularization paths by considering three simulation settings. In the first one, the variables are independent allowing the evaluation of the methods in the theoretical framework used to develop them. In the second setting, two structures of the correlation between variables are considered to evaluate how biological dependencies usually observed affect the estimation. Finally, the third setting mimics the biological complexity of transcription factor regulations, it is the farthest setting from the Gaussian framework. In all the settings, the capacity of prediction and the identification of the explaining variables are evaluated for each method. Our results show that variable selection procedures rely on statistical assumptions that should be carefully checked. The Gaussian assumption and the number of explaining variables are the two key points. As soon as correlation exists, the regularization function Elastic-net provides better results than Lasso. LinSelect, a non-asymptotic model selection method, should be preferred to the eBIC criterion commonly used. Bolasso is a judicious strategy to limit the selection of non explaining variables.

[17]  arXiv:2109.12042 (cross-list from stat.ML) [pdf, other]
Title: Combining Discrete Choice Models and Neural Networks through Embeddings: Formulation, Interpretability and Performance
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)

This study proposes a novel approach that combines theory and data-driven choice models using Artificial Neural Networks (ANNs). In particular, we use continuous vector representations, called embeddings, for encoding categorical or discrete explanatory variables with a special focus on interpretability and model transparency. Although embedding representations within the logit framework have been conceptualized by Camara (2019), their dimensions do not have an absolute definitive meaning, hence offering limited behavioral insights. The novelty of our work lies in enforcing interpretability to the embedding vectors by formally associating each of their dimensions to a choice alternative. Thus, our approach brings benefits much beyond a simple parsimonious representation improvement over dummy encoding, as it provides behaviorally meaningful outputs that can be used in travel demand analysis and policy decisions. Additionally, in contrast to previously suggested ANN-based Discrete Choice Models (DCMs) that either sacrifice interpretability for performance or are only partially interpretable, our models preserve interpretability of the utility coefficients for all the input variables despite being based on ANN principles. The proposed models were tested on two real world datasets and evaluated against benchmark and baseline models that use dummy-encoding. The results of the experiments indicate that our models deliver state-of-the-art predictive performance, outperforming existing ANN-based models while drastically reducing the number of required network parameters.

Replacements for Mon, 27 Sep 21

[18]  arXiv:1904.06340 (replaced) [pdf, other]
Title: A Composite Likelihood-based Approach for Change-point Detection in Spatio-temporal Process
Subjects: Methodology (stat.ME)
[19]  arXiv:1907.08414 (replaced) [pdf, other]
Title: Reluctant Interaction Modeling
Subjects: Methodology (stat.ME); Computation (stat.CO)
[20]  arXiv:2006.01924 (replaced) [pdf, other]
Title: Eigenvectors from Eigenvalues Sparse Principal Component Analysis (EESPCA)
Authors: H. Robert Frost
Subjects: Methodology (stat.ME); Quantitative Methods (q-bio.QM)
[21]  arXiv:2012.11501 (replaced) [pdf, other]
Title: Sparse tensor product approximation for a class of generalized method of moments estimators
Comments: 33 pages, 4 tables, 4 figures
Subjects: Methodology (stat.ME); Numerical Analysis (math.NA)
[22]  arXiv:2102.13550 (replaced) [pdf, other]
Title: An introduction to the determination of the probability of a successful trial: Frequentist and Bayesian approaches
Comments: 27 pages
Subjects: Methodology (stat.ME); Applications (stat.AP)
[23]  arXiv:2103.01097 (replaced) [pdf, other]
Title: Tangent functional canonical correlation analysis for densities and shapes, with applications to multimodal imaging data
Subjects: Methodology (stat.ME); Applications (stat.AP)
[24]  arXiv:2108.03544 (replaced) [pdf]
Title: Resurrecting the One-Sided P-value as a Likelihood Ratio
Authors: Nicholas Adams
Comments: 18 pages, 2 figures
Subjects: Methodology (stat.ME)
[25]  arXiv:2108.05990 (replaced) [pdf, other]
Title: Statistical Learning using Sparse Deep Neural Networks in Empirical Risk Minimization
Subjects: Methodology (stat.ME)
[26]  arXiv:2006.16901 (replaced) [pdf, other]
Title: Hierarchical sparse Cholesky decomposition with applications to high-dimensional spatio-temporal filtering
Subjects: Computation (stat.CO); Methodology (stat.ME)
[27]  arXiv:2009.05318 (replaced) [pdf, ps, other]
Title: Augmented pseudo-marginal Metropolis-Hastings for partially observed diffusion processes
Comments: 26 pages
Subjects: Computation (stat.CO); Methodology (stat.ME)
[28]  arXiv:2108.02115 (replaced) [pdf, other]
Title: An autoregressive model for a censored data denoising method robust to outliers with application to the Obépine SARS-Cov-2 monitoring
Comments: 16 pages, 10 figures
Subjects: Applications (stat.AP); Methodology (stat.ME)
[ total of 28 entries: 1-28 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2109, contact, help  (Access key information)