We gratefully acknowledge support from
the Simons Foundation and member institutions.


New submissions

[ total of 29 entries: 1-29 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Wed, 8 Feb 23

[1]  arXiv:2302.03157 [pdf, other]
Title: A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data
Subjects: Methodology (stat.ME); Optimization and Control (math.OC); Machine Learning (stat.ML)

We create a mixed-integer optimization (MIO) approach for doing cluster-aware regression, i.e. linear regression that takes into account the inherent clustered structure of the data. We compare to the linear mixed effects regression (LMEM) which is the most used current method, and design simulation experiments to show superior performance to LMEM in terms of both predictive and inferential metrics in silico. Furthermore, we show how our method is formulated in a very interpretable way; LMEM cannot generalize and make cluster-informed predictions when the cluster of new data points is unknown, but we solve this problem by training an interpretable classification tree that can help decide cluster effects for new data points, and demonstrate the power of this generalizability on a real protein expression dataset.

[2]  arXiv:2302.03178 [pdf, other]
Title: Nonlinear Causal Discovery with Confounders
Comments: 28 pages, 4 figures, 3 tables; a version is accepted by Journal of the American Statistical Association
Subjects: Methodology (stat.ME)

This article introduces a causal discovery method to learn nonlinear relationships in a directed acyclic graph with correlated Gaussian errors due to confounding. First, we derive model identifiability under the sublinear growth assumption. Then, we propose a novel method, named the Deconfounded Functional Structure Estimation (DeFuSE), consisting of a deconfounding adjustment to remove the confounding effects and a sequential procedure to estimate the causal order of variables. We implement DeFuSE via feedforward neural networks for scalable computation. Moreover, we establish the consistency of DeFuSE under an assumption called the strong causal minimality. In simulations, DeFuSE compares favorably against state-of-the-art competitors that ignore confounding or nonlinearity. Finally, we demonstrate the utility and effectiveness of the proposed approach with an application to gene regulatory network analysis. The Python implementation is available at https://github.com/chunlinli/defuse.

[3]  arXiv:2302.03200 [pdf, other]
Title: Multivariate Bayesian dynamic modeling for causal prediction
Subjects: Methodology (stat.ME)

Bayesian dynamic modeling and forecasting is developed in the setting of sequential time series analysis for causal inference. Causal evaluation of sequentially observed time series data from control and treated units focuses on the impacts of interventions using synthetic control constructs. Methodological contributions include the development of multivariate dynamic models for time-varying effects across multiple treated units and explicit foci on sequential learning of effects of interventions. Analysis explores the utility of dimension reduction of multiple potential synthetic control variables. These methodological advances are evaluated in a detailed case study in commercial forecasting. This involves in-study evaluation of interventions in a supermarket promotions experiment, with coupled predictive analyses in selected regions of a large-scale commercial system. Generalization of causal predictive inferences from experimental settings to broader populations is a central concern, and one that can be impacted by cross-series dependencies.

[4]  arXiv:2302.03237 [pdf, ps, other]
Title: Examination of Nonlinear Longitudinal Processes with Latent Variables, Latent Processes and Latent Classes: The R package NonLinearCurve
Authors: Jin Liu
Comments: Draft version 1.0, February 6, 2023. This paper has not been peer reviewed. Please do not copy or cite without author's permission
Subjects: Methodology (stat.ME)

We introduce R package NonLinearCurve that provides a series of functions to evaluate longitudinal processes with individual measurement occasions in the structural equation modeling (SEM) framework. It aims to provide computational tools for nonlinear longitudinal models, especially intrinsically nonlinear longitudinal models, in the scenarios of (1) univariate longitudinal process captured by a series of latent variables, without or with covariates, including time-invariant covariates (TICs) and time-varying covariates (TVCs), (2) multivariate longitudinal processes to assess correlation or causation between longitudinal variables, and (3) mixture version of the models in scenario 1 or 2 with an assumption that trajectories are from heterogeneous latent classes. By interfacing to R package OpenMx, NonLinearCurve allows for the flexible specification of structural equation models and generates maximum likelihood estimators based on the full information maximum likelihood technique. The package provides an algorithm to have a set of initial values from the raw data, aiming to facilitate computation and improve the likelihood of model convergence. The package also provides functions for goodness-of-fit analyses, clustering analyses, plots, and predicted trajectories. This paper constitutes a companion paper to the package with introductions of each scenario of models, the estimation technique, some implementation details, output interpretation, and giving examples through a dataset on intelligence development.

[5]  arXiv:2302.03435 [pdf, ps, other]
Title: Logistic regression with missing responses and predictors: a review of existing approaches and a case study
Comments: 13 pages, 14 tables
Subjects: Methodology (stat.ME); Applications (stat.AP)

In this work logistic regression when both the response and the predictor variables may be missing is considered. Several existing approaches are reviewed, including complete case analysis, inverse probability weighting, multiple imputation and maximum likelihood. The methods are compared in a simulation study, which serves to evaluate the bias, the variance and the mean squared error of the estimators for the regression coefficients. In the simulations, the maximum likelihood methodology is the one that presents the best results, followed by multiple imputation with five imputations, which is the second best. The methods are applied to a case study on the obesity for schoolchildren in the municipality of Viana do Castelo, North Portugal, where a logistic regression model is used to predict the International Obesity Task Force (IOTF) indicator from physical examinations and the past values of the obesity status. All the variables in the case study are potentially missing, with gender as the only exception. The results provided by the several methods are in well agreement, indicating the relevance of the past values of IOTF and physical scores for the prediction of obesity. Practical recommendations are given.

[6]  arXiv:2302.03440 [pdf, other]
Title: Comparison of Quantile Regression Curves with Censored Data
Subjects: Methodology (stat.ME); Applications (stat.AP)

This paper proposes a new test for the comparison of conditional quantile curves when the outcome of interest, typically a duration, is subject to right censoring. The test can be applied both in the case of two independent samples and for paired data, and can be used for the comparison of quantiles at a fixed quantile level, a finite set of levels or a range of quantile levels. The asymptotic distribution of the proposed test statistics is obtained both under the null hypothesis and under local alternatives. We describe a bootstrap procedure in order to approximate the critical values, and present the results of a simulation study, in which the performance of the tests for small and moderate sample sizes is studied and compared with the behavior of alternative tests. Finally, we apply the proposed tests on a data set concerning diabetic retinopathy.

[7]  arXiv:2302.03544 [pdf, other]
Title: Causally-Interpretable Random-Effects Meta-Analysis
Subjects: Methodology (stat.ME)

Recent work has made important contributions in the development of causally-interpretable meta-analysis. These methods transport treatment effects estimated in a collection of randomized trials to a target population of interest. Ideally, estimates targeted toward a specific population are more interpretable and relevant to policy-makers and clinicians. However, between-study heterogeneity not arising from differences in the distribution of treatment effect modifiers can raise difficulties in synthesizing estimates across trials. The existence of such heterogeneity, including variations in treatment modality, also complicates the interpretation of transported estimates as a generic effect in the target population. We propose a conceptual framework and estimation procedures that attempt to account for such heterogeneity, and develop inferential techniques that aim to capture the accompanying excess variability in causal estimates. This framework also seeks to clarify the kind of treatment effects that are amenable to the techniques of generalizability and transportability.

[8]  arXiv:2302.03558 [pdf]
Title: Enhanced Inference for Finite Population Sampling-Based Prevalence Estimation with Misclassification Errors
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

Epidemiologic screening programs often make use of tests with small, but non-zero probabilities of misdiagnosis. In this article, we assume the target population is finite with a fixed number of true cases, and that we apply an imperfect test with known sensitivity and specificity to a sample of individuals from the population. In this setting, we propose an enhanced inferential approach for use in conjunction with sampling-based bias-corrected prevalence estimation. While ignoring the finite nature of the population can yield markedly conservative estimates, direct application of a standard finite population correction (FPC) conversely leads to underestimation of variance. We uncover a way to leverage the typical FPC indirectly toward valid statistical inference. In particular, we derive a readily estimable extra variance component induced by misclassification in this specific but arguably common diagnostic testing scenario. Our approach yields a standard error estimate that properly captures the sampling variability of the usual bias-corrected maximum likelihood estimator of disease prevalence. Finally, we develop an adapted Bayesian credible interval for the true prevalence that offers improved frequentist properties (i.e., coverage and width) relative to a Wald-type confidence interval. We report the simulation results to demonstrate the enhanced performance of the proposed inferential methods.

Cross-lists for Wed, 8 Feb 23

[9]  arXiv:2302.03172 (cross-list from econ.EM) [pdf, ps, other]
Title: High-Dimensional Conditionally Gaussian State Space Models with Missing Data
Subjects: Econometrics (econ.EM); Computation (stat.CO); Methodology (stat.ME)

We develop an efficient sampling approach for handling complex missing data patterns and a large number of missing observations in conditionally Gaussian state space models. Two important examples are dynamic factor models with unbalanced datasets and large Bayesian VARs with variables in multiple frequencies. A key insight underlying the proposed approach is that the joint distribution of the missing data conditional on the observed data is Gaussian. Moreover, the inverse covariance or precision matrix of this conditional distribution is sparse, and this special structure can be exploited to substantially speed up computations. We illustrate the methodology using two empirical applications. The first application combines quarterly, monthly and weekly data using a large Bayesian VAR to produce weekly GDP estimates. In the second application, we extract latent factors from unbalanced datasets involving over a hundred monthly variables via a dynamic factor model with stochastic volatility.

[10]  arXiv:2302.03246 (cross-list from cs.LG) [pdf, other]
Title: CDANs: Temporal Causal Discovery from Autocorrelated and Non-Stationary Time Series Data
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)

This study presents a novel constraint-based causal discovery approach for autocorrelated and non-stationary time series data (CDANs). Our proposed method addresses several limitations of existing causal discovery methods for autocorrelated and non-stationary time series data, such as high dimensionality, the inability to identify lagged causal relationships, and the overlook of changing modules. Our approach identifies both lagged and instantaneous/contemporaneous causal relationships along with changing modules that vary over time. The method optimizes the conditioning sets in a constraint-based search by considering lagged parents instead of conditioning on the entire past that addresses high dimensionality. The changing modules are detected by considering both contemporaneous and lagged parents. The approach first detects the lagged adjacencies, then identifies the changing modules and contemporaneous adjacencies, and finally determines the causal direction. We extensively evaluated the proposed method using synthetic datasets and a real-world clinical dataset and compared its performance with several baseline approaches. The results demonstrate the effectiveness of the proposed method in detecting causal relationships and changing modules in autocorrelated and non-stationary time series data.

[11]  arXiv:2302.03314 (cross-list from stat.ML) [pdf, other]
Title: Federated Variational Inference Methods for Structured Latent Variable Models
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)

Federated learning methods, that is, methods that perform model training using data situated across different sources, whilst simultaneously not having the data leave their original source, are of increasing interest in a number of fields. However, despite this interest, the classes of models for which easily-applicable and sufficiently general approaches are available is limited, excluding many structured probabilistic models. We present a general yet elegant resolution to the aforementioned issue. The approach is based on adopting structured variational inference, an approach widely used in Bayesian machine learning, to the federated setting. Additionally, a communication-efficient variant analogous to the canonical FedAvg algorithm is explored. The effectiveness of the proposed algorithms are demonstrated, and their performance is compared on Bayesian multinomial regression, topic modelling, and mixed model examples.

[12]  arXiv:2302.03391 (cross-list from stat.ML) [pdf, other]
Title: Sparse GEMINI for Joint Discriminative Clustering and Feature Selection
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)

Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on $p(\pmb{x})$, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple $\ell_1$ penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a clustering model $p_\theta(y|\pmb{x})$. We demonstrate the performances of Sparse GEMINI on synthetic datasets as well as large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.

[13]  arXiv:2302.03687 (cross-list from econ.EM) [pdf, ps, other]
Title: Efficient Covariate Adjustment in Stratified Experiments
Authors: Max Cytrynbaum
Subjects: Econometrics (econ.EM); Methodology (stat.ME)

This paper studies covariate adjusted estimation of the average treatment effect (ATE) in stratified experiments. We work in the stratified randomization framework of Cytrynbaum (2021), which includes matched tuples designs (e.g. matched pairs), coarse stratification, and complete randomization as special cases. Interestingly, we show that the Lin (2013) interacted regression is generically asymptotically inefficient, with efficiency only in the edge case of complete randomization. Motivated by this finding, we derive the optimal linear covariate adjustment for a given stratified design, constructing several new estimators that achieve the minimal variance. Conceptually, we show that optimal linear adjustment of a stratified design is equivalent in large samples to doubly-robust semiparametric adjustment of an independent design. We also develop novel asymptotically exact inference for the ATE over a general family of adjusted estimators, showing in simulations that the usual Eicker-Huber-White confidence intervals can significantly overcover. Our inference methods produce shorter confidence intervals by fully accounting for the precision gains from both covariate adjustment and stratified randomization. Simulation experiments and an empirical application to the Oregon Health Insurance Experiment data (Finkelstein et al. (2012)) demonstrate the value of our proposed methods.

Replacements for Wed, 8 Feb 23

[14]  arXiv:2012.11100 (replaced) [pdf, other]
Title: Two-directional simultaneous inference for high-dimensional models
Subjects: Methodology (stat.ME)
[15]  arXiv:2107.00153 (replaced) [pdf, other]
Title: Root and community inference on the latent growth process of a network
Authors: Harry Crane, Min Xu
Comments: 69 pages; 29 figures
Subjects: Methodology (stat.ME); Probability (math.PR); Computation (stat.CO)
[16]  arXiv:2107.10017 (replaced) [pdf, other]
Title: Permutation-based multiple testing corrections for p-values and confidence intervals for cluster randomised trials
Subjects: Methodology (stat.ME)
[17]  arXiv:2111.10718 (replaced) [pdf, other]
Title: The R2D2 Prior for Generalized Linear Mixed Models
Subjects: Methodology (stat.ME)
[18]  arXiv:2202.08728 (replaced) [pdf, other]
Title: A nonparametric extension of randomized response for private confidence sets
Comments: 49 pages, 6 figures
Subjects: Methodology (stat.ME); Cryptography and Security (cs.CR); Statistics Theory (math.ST); Machine Learning (stat.ML)
[19]  arXiv:2202.12263 (replaced) [pdf, other]
Title: Causal Effect Identification in Cluster DAGs
Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[20]  arXiv:2203.00768 (replaced) [pdf, other]
Title: Privacy-Preserving, Communication-Efficient, and Target-Flexible Hospital Quality Measurement
Comments: 49 pages of main text + 28 pages of supplemental material
Subjects: Methodology (stat.ME); Applications (stat.AP)
[21]  arXiv:2205.14504 (replaced) [pdf, other]
Title: Bayesian prediction via nonparametric transformation models
Comments: The corresponding R package BuLTM is available on GitHub this https URL
Subjects: Methodology (stat.ME)
[22]  arXiv:2205.15461 (replaced) [pdf, other]
Title: Derandomized knockoffs: leveraging e-values for false discovery rate control
Comments: 26 pages, 7 figures and 2 tables
Subjects: Methodology (stat.ME)
[23]  arXiv:2210.03964 (replaced) [pdf, other]
Title: An Efficient and Continuous Voronoi Density Estimator
Comments: 13 pages
Subjects: Methodology (stat.ME); Computational Geometry (cs.CG)
[24]  arXiv:2211.03758 (replaced) [pdf, other]
Title: Privacy Aware Experiments without Cookies
Comments: Technical report supplementing paper accepted to WSDM 23
Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
[25]  arXiv:1907.01049 (replaced) [pdf, other]
Title: Permutation inference with a finite number of heterogeneous clusters
Authors: Andreas Hagemann
Comments: 28 pages, 3 figures, 2 tables; final pre-publication version
Subjects: Econometrics (econ.EM); Methodology (stat.ME)
[26]  arXiv:2110.10745 (replaced) [pdf, other]
Title: Iterated Block Particle Filter for High-dimensional Parameter Learning: Beating the Curse of Dimensionality
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME)
[27]  arXiv:2204.02299 (replaced) [pdf, other]
Title: Theoretical properties of Bayesian Student-$t$ linear regression
Journal-ref: Statistics & Probability Letters, 193, 1-8 (2023)
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)
[28]  arXiv:2211.08573 (replaced) [pdf, other]
Title: Realization of Causal Representation Learning to Adjust Confounding Bias in Latent Space
Subjects: Machine Learning (cs.LG); Methodology (stat.ME)
[29]  arXiv:2212.09706 (replaced) [pdf, ps, other]
Title: Multiple testing under negative dependence
Comments: 24 pages, 3 figures
Subjects: Statistics Theory (math.ST); Probability (math.PR); Methodology (stat.ME)
[ total of 29 entries: 1-29 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2302, contact, help  (Access key information)