Methodology
New submissions
[ showing up to 2000 entries per page: fewer | more ]
New submissions for Tue, 19 Mar 24
- [1] arXiv:2403.10680 [pdf, other]
-
Title: Spatio-temporal Occupancy Models with INLASubjects: Methodology (stat.ME); Applications (stat.AP)
Modern methods for quantifying and predicting species distribution play a crucial part in biodiversity conservation. Occupancy models are a popular choice for analyzing species occurrence data as they allow to separate the observational error induced by imperfect detection, and the sources of bias affecting the occupancy process. However, the spatial and temporal variation in occupancy not accounted for by environmental covariates is often ignored or modelled through simple spatial structures as the computational costs of fitting explicit spatio-temporal models is too high. In this work, we demonstrate how INLA may be used to fit complex occupancy models and how the R-INLA package can provide a user-friendly interface to make such complex models available to users.
We show how occupancy models, provided some simplification on the detection process, can be framed as latent Gaussian models and benefit from the powerful INLA machinery. A large selection of complex modelling features, and random effect modelshave already been implemented in R-INLA. These become available for occupancy models, providing the user with an efficient and flexible toolbox.
We illustrate how INLA provides a computationally efficient framework for developing and fitting complex occupancy models using two case studies. Through these, we show how different spatio-temporal models that include spatial-varying trends, smooth terms, and spatio-temporal random effects can be fitted. At the cost of limiting the complexity of the detection model, INLA can incorporate a range of complex structures in the process.
INLA-based occupancy models provide an alternative framework to fit complex spatiotemporal occupancy models. The need for new and more flexible computationally approaches to fit such models makes INLA an attractive option for addressing complex ecological problems, and a promising area of research. - [2] arXiv:2403.10742 [pdf, other]
-
Title: Assessing Delayed Treatment Benefits of Immunotherapy Using Long-Term Average Hazard: A Novel Test/Estimation ApproachSubjects: Methodology (stat.ME)
Delayed treatment effects on time-to-event outcomes have often been observed in randomized controlled studies of cancer immunotherapies. In the case of delayed onset of treatment effect, the conventional test/estimation approach using the log-rank test for between-group comparison and Cox's hazard ratio to estimate the magnitude of treatment effect is not optimal, because the log-rank test is not the most powerful option, and the interpretation of the resulting hazard ratio is not obvious. Recently, alternative test/estimation approaches were proposed to address both the power issue and the interpretation problems of the conventional approach. One is a test/estimation approach based on long-term restricted mean survival time, and the other approach is based on average hazard with survival weight. This paper integrates these two ideas and proposes a novel test/estimation approach based on long-term average hazard (LT-AH) with survival weight. Numerical studies reveal specific scenarios where the proposed LT-AH method provides a higher power than the two alternative approaches. The proposed approach has test/estimation coherency and can provide robust estimates of the magnitude of treatment effect not dependent on study-specific censoring time distribution. Also, the proposed LT-AH approach can summarize the magnitude of the treatment effect in both absolute difference and relative terms using ``hazard'' (i.e., difference in LT-AH and ratio of LT-AH), meeting guideline recommendations and practical needs. This proposed approach can be a useful alternative to the traditional hazard-based test/estimation approach when delayed onset of survival benefit is expected.
- [3] arXiv:2403.10878 [pdf, ps, other]
-
Title: Cubature scheme for spatio-temporal Poisson point processes estimationComments: arXiv admin note: text overlap with arXiv:2302.13684, arXiv:2209.07153Subjects: Methodology (stat.ME); Computation (stat.CO)
This work presents the cubature scheme for the fitting of spatio-temporal Poisson point processes. The methodology is implemented in the R Core Team (2024) package stopp (D'Angelo and Adelfio, 2023), published on the Comprehensive R Archive Network (CRAN) and available from https://CRAN.R-project.org/package=stopp. Since the number of dummy points should be sufficient for an accurate estimate of the likelihood, numerical experiments are currently under development to give guidelines on this aspect.
- [4] arXiv:2403.10945 [pdf, other]
-
Title: Zero-Inflated Stochastic Volatility Model for Disaggregated Inflation Data with Exact ZerosSubjects: Methodology (stat.ME); Applications (stat.AP)
The disaggregated time-series data for Consumer Price Index often exhibits frequent instances of exact zero price changes, stemming from measurement errors inherent in the data collection process. However, the currently prominent stochastic volatility model of trend inflation is designed for aggregate measures of price inflation, where exact zero price changes rarely occur. We propose a zero-inflated stochastic volatility model applicable to such nonstationary real-valued multivariate time-series data with exact zeros, by a Bayesian dynamic generalized linear model that jointly specifies the dynamic zero-generating process. We also provide an efficient custom Gibbs sampler that leverages the P\'olya-Gamma augmentation. Applying the model to disaggregated Japanese Consumer Price Index data, we find that the zero-inflated model provides more sensible and informative estimates of time-varying trend and volatility. Through an out-of-sample forecasting exercise, we find that the zero-inflated model provides improved point forecasts when zero-inflation is prominent, and better coverage of interval forecasts of the non-zero data by the non-zero distributional component.
- [5] arXiv:2403.11003 [pdf, other]
-
Title: Extreme Treatment Effect: Extrapolating Causal Effects Into Extreme Treatment DomainAuthors: Juraj BodikSubjects: Methodology (stat.ME)
The potential outcomes framework serves as a fundamental tool for quantifying the causal effects. When the treatment variable (exposure) is continuous, one is typically interested in the estimation of the effect curve (also called the average dose-response function), denoted as \(mu(t)\). In this work, we explore the ``extreme causal effect,'' where our focus lies in determining the impact of an extreme level of treatment, potentially beyond the range of observed values--that is, estimating \(mu(t)\) for very large \(t\). Our framework is grounded in the field of statistics known as extreme value theory. We establish the foundation for our approach, outlining key assumptions that enable the estimation of the extremal causal effect. Additionally, we present a novel and consistent estimation procedure that utilizes extreme value theory in order to potentially reduce the dimension of the confounders to at most 3. In practical applications, our framework proves valuable when assessing the effects of scenarios such as drug overdoses, extreme river discharges, or extremely high temperatures on a variable of interest.
- [6] arXiv:2403.11017 [pdf, other]
-
Title: Continuous-time mediation analysis for repeatedlymeasured mediators and outcomesSubjects: Methodology (stat.ME)
Mediation analysis aims to decipher the underlying causal mechanisms between an exposure, an outcome, and intermediate variables called mediators. Initially developed for fixed-time mediator and outcome, it has been extended to the framework of longitudinal data by discretizing the assessment times of mediator and outcome. Yet, processes in play in longitudinal studies are usually defined in continuous time and measured at irregular and subject-specific visits. This is the case in dementia research when cerebral and cognitive changes measured at planned visits in cohorts are of interest. We thus propose a methodology to estimate the causal mechanisms between a time-fixed exposure ($X$), a mediator process ($\mathcal{M}_t$) and an outcome process ($\mathcal{Y}_t$) both measured repeatedly over time in the presence of a time-dependent confounding process ($\mathcal{L}_t$). We consider three types of causal estimands, the natural effects, path-specific effects and randomized interventional analogues to natural effects, and provide identifiability assumptions. We employ a dynamic multivariate model based on differential equations for their estimation. The performance of the methods are explored in simulations, and we illustrate the method in two real-world examples motivated by the 3C cerebral aging study to assess: (1) the effect of educational level on functional dependency through depressive symptomatology and cognitive functioning, and (2) the effect of a genetic factor on cognitive functioning potentially mediated by vascular brain lesions and confounded by neurodegeneration.
- [7] arXiv:2403.11163 [pdf, ps, other]
-
Title: A Selective Review on Statistical Methods for Massive Data Computation: Distributed Computing, Subsampling, and Minibatch TechniquesAuthors: Xuetong Li, Yuan Gao, Hong Chang, Danyang Huang, Yingying Ma, Rui Pan, Haobo Qi, Feifei Wang, Shuyuan Wu, Ke Xu, Jing Zhou, Xuening Zhu, Yingqiu Zhu, Hansheng WangSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the sample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models.
- [8] arXiv:2403.11276 [pdf, other]
-
Title: Effects of model misspecification on small area estimatorsSubjects: Methodology (stat.ME); Applications (stat.AP)
Nested error regression models are commonly used to incorporate observational unit specific auxiliary variables to improve small area estimates. When the mean structure of this model is misspecified, there is generally an increase in the mean square prediction error (MSPE) of Empirical Best Linear Unbiased Predictors (EBLUP). Observed Best Prediction (OBP) method has been proposed with the intent to improve on the MSPE over EBLUP. We conduct a Monte Carlo simulation experiment to understand the effect of mispsecification of mean structures on different small area estimators. Our simulation results lead to an unexpected result that OBP may perform very poorly when observational unit level auxiliary variables are used and that OBP can be improved significantly when population means of those auxiliary variables (area level auxiliary variables) are used in the nested error regression model or when a corresponding area level model is used. Our simulation also indicates that the MSPE of OBP in an increasing function of the difference between the sample and population means of the auxiliary variables.
- [9] arXiv:2403.11356 [pdf, other]
-
Title: Multiscale Quantile Regression with Local Error ControlComments: The implementation is in R package muscle, available at \url{this https URL}Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
For robust and efficient detection of change points, we introduce a novel methodology MUSCLE (multiscale quantile segmentation controlling local error) that partitions serial data into multiple segments, each sharing a common quantile. It leverages multiple tests for quantile changes over different scales and locations, and variational estimation. Unlike the often adopted global error control, MUSCLE focuses on local errors defined on individual segments, significantly improving detection power in finding change points. Meanwhile, due to the built-in model complexity penalty, it enjoys the finite sample guarantee that its false discovery rate (or the expected proportion of falsely detected change points) is upper bounded by its unique tuning parameter. Further, we obtain the consistency and the localisation error rates in estimating change points, under mild signal-to-noise-ratio conditions. Both match (up to log factors) the minimax optimality results in the Gaussian setup. All theories hold under the only distributional assumption of serial independence. Incorporating the wavelet tree data structure, we develop an efficient dynamic programming algorithm for computing MUSCLE. Extensive simulations as well as real data applications in electrophysiology and geophysics demonstrate its competitiveness and effectiveness. An implementation via R package muscle is available from GitHub.
- [10] arXiv:2403.11438 [pdf, other]
-
Title: Models of linkage error for capture-recapture estimation without clerical reviewsComments: 42 pagesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
The capture-recapture method can be applied to measure the coverage of administrative and big data sources, in official statistics. In its basic form, it involves the linkage of two sources while assuming a perfect linkage and other standard assumptions. In practice, linkage errors arise and are a potential source of bias, where the linkage is based on quasi-identifiers. These errors include false positives and false negatives, where the former arise when linking a pair of records from different units, and the latter arise when not linking a pair of records from the same unit. So far, the existing solutions have resorted to costly clerical reviews, or they have made the restrictive conditional independence assumption. In this work, these requirements are relaxed by modeling the number of links from a record instead. The same approach may be taken to estimate the linkage accuracy without clerical reviews, when linking two sources that each have some undercoverage.
- [11] arXiv:2403.11562 [pdf, other]
-
Title: A Comparison of Joint Species Distribution Models for Percent Cover DataSubjects: Methodology (stat.ME)
1. Joint species distribution models (JSDMs) have gained considerable traction among ecologists over the past decade, due to their capacity to answer a wide range of questions at both the species- and the community-level. The family of generalized linear latent variable models in particular has proven popular for building JSDMs, being able to handle many response types including presence-absence data, biomass, overdispersed and/or zero-inflated counts.
2. We extend latent variable models to handle percent cover data, with vegetation, sessile invertebrate, and macroalgal cover data representing the prime examples of such data arising in community ecology.
3. Sparsity is a commonly encountered challenge with percent cover data. Responses are typically recorded as percentages covered per plot, though some species may be completely absent or present, i.e., have 0% or 100% cover respectively, rendering the use of beta distribution inadequate.
4. We propose two JSDMs suitable for percent cover data, namely a hurdle beta model and an ordered beta model. We compare the two proposed approaches to a beta distribution for shifted responses, transformed presence-absence data, and an ordinal model for percent cover classes. Results demonstrate the hurdle beta JSDM was generally the most accurate at retrieving the latent variables and predicting ecological percent cover data. - [12] arXiv:2403.11564 [pdf, other]
-
Title: Spatio-temporal point process intensity estimation using zero-deflated subsampling applied to a lightning strikes dataset in FranceAuthors: Jean-François Coeurjolly (LJK, SVH), Anne-Laure Fougères (ICJ, MODAL'X), Thibault Espinasse (PSPM, UCBL), Mathieu Ribatet (I3M)Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
Cloud-to-ground lightning strikes observed in a specific geographical domain over time can be naturally modeled by a spatio-temporal point process. Our focus lies in the parametric estimation of its intensity function, incorporating both spatial factors (such as altitude) and spatio-temporal covariates (such as field temperature, precipitation, etc.). The events are observed in France over a span of three years. Spatio-temporal covariates are observed with resolution $0.1^\circ \times 0.1^\circ$ ($\approx 100$km$^2$) and six-hour periods. This results in an extensive dataset, further characterized by a significant excess of zeroes (i.e., spatio-temporal cells with no observed events). We reexamine composite likelihood methods commonly employed for spatial point processes, especially in situations where covariates are piecewise constant. Additionally, we extend these methods to account for zero-deflated subsampling, a strategy involving dependent subsampling, with a focus on selecting more cells in regions where events are observed. A simulation study is conducted to illustrate these novel methodologies, followed by their application to the dataset of lightning strikes.
- [13] arXiv:2403.11767 [pdf, other]
-
Title: Multiple testing in game-theoretic probability: pictures and questionsAuthors: Vladimir VovkComments: 19 pages, 6 figuresSubjects: Methodology (stat.ME)
The usual way of testing probability forecasts in game-theoretic probability is via construction of test martingales. The standard assumption is that all forecasts are output by the same forecaster. In this paper I will discuss possible extensions of this picture to testing probability forecasts output by several forecasters. This corresponds to multiple hypothesis testing in statistics. One interesting phenomenon is that even a slight relaxation of the requirement of family-wise validity leads to a very significant increase in the efficiency of testing procedures. The main goal of this paper is to report results of preliminary simulation studies and list some directions of further research.
- [14] arXiv:2403.11954 [pdf, other]
-
Title: Robust Estimation and Inference in Categorical DataAuthors: Max WelzComments: 63 pages, 7 figures, 6 tablesSubjects: Methodology (stat.ME); Econometrics (econ.EM); Statistics Theory (math.ST)
In empirical science, many variables of interest are categorical. Like any model, models for categorical responses can be misspecified, leading to possibly large biases in estimation. One particularly troublesome source of misspecification is inattentive responding in questionnaires, which is well-known to jeopardize the validity of structural equation models (SEMs) and other survey-based analyses. I propose a general estimator that is designed to be robust to misspecification of models for categorical responses. Unlike hitherto approaches, the estimator makes no assumption whatsoever on the degree, magnitude, or type of misspecification. The proposed estimator generalizes maximum likelihood estimation, is strongly consistent, asymptotically Gaussian, has the same time complexity as maximum likelihood, and can be applied to any model for categorical responses. In addition, I develop a novel test that tests whether a given response can be fitted well by the assumed model, which allows one to trace back possible sources of misspecification. I verify the attractive theoretical properties of the proposed methodology in Monte Carlo experiments, and demonstrate its practical usefulness in an empirical application on a SEM of personality traits, where I find compelling evidence for the presence of inattentive responding whose adverse effects the proposed estimator can withstand, unlike maximum likelihood.
- [15] arXiv:2403.11983 [pdf, other]
-
Title: Proposal of a general framework to categorize continuous predictor variablesSubjects: Methodology (stat.ME)
The use of discretized variables in the development of prediction models is a common practice, in part because the decision-making process is more natural when it is based on rules created from segmented models. Although this practice is perhaps more common in medicine, it is extensible to any area of knowledge where a predictive model helps in decision-making. Therefore, providing researchers with a useful and valid categorization method could be a relevant issue when developing prediction models. In this paper, we propose a new general methodology that can be applied to categorize a predictor variable in any regression model where the response variable belongs to the exponential family distribution. Furthermore, it can be applied in any multivariate context, allowing to categorize more than one continuous covariate simultaneously. In addition, a computationally very efficient method is proposed to obtain the optimal number of categories, based on a pseudo-BIC proposal. Several simulation studies have been conducted in which the efficiency of the method with respect to both the location and the number of estimated cut-off points is shown. Finally, the categorization proposal has been applied to a real data set of 543 patients with chronic obstructive pulmonary disease from Galdakao Hospital's five outpatient respiratory clinics, who were followed up for 10 years. We applied the proposed methodology to jointly categorize the continuous variables six-minute walking test and forced expiratory volume in one second in a multiple Poisson generalized additive model for the response variable rate of the number of hospital admissions by years of follow-up. The location and number of cut-off points obtained were clinically validated as being in line with the categorizations used in the literature.
Cross-lists for Tue, 19 Mar 24
- [16] arXiv:2401.01998 (cross-list from stat.AP) [pdf, other]
-
Title: A Corrected Score Function Framework for Modelling Circadian Gene ExpressionSubjects: Applications (stat.AP); Methodology (stat.ME)
Many biological processes display oscillatory behavior based on an approximately 24 hour internal timing system specific to each individual. One process of particular interest is gene expression, for which several circadian transcriptomic studies have identified associations between gene expression during a 24 hour period and an individual's health. A challenge with analyzing data from these studies is that each individual's internal timing system is offset relative to the 24 hour day-night cycle, where day-night cycle time is recorded for each collected sample. Laboratory procedures can accurately determine each individual's offset and determine the internal time of sample collection. However, these laboratory procedures are labor-intensive and expensive. In this paper, we propose a corrected score function framework to obtain a regression model of gene expression given internal time when the offset of each individual is too burdensome to determine. A feature of this framework is that it does not require the probability distribution generating offsets to be symmetric with a mean of zero. Simulation studies validate the use of this corrected score function framework for cosinor regression, which is prevalent in circadian transcriptomic studies. Illustrations with three real circadian transcriptomic data sets further demonstrate that the proposed framework consistently mitigates bias relative to using a score function that does not account for this offset.
- [17] arXiv:2403.10567 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Uncertainty estimation in spatial interpolation of satellite precipitation with ensemble learningSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Predictions in the form of probability distributions are crucial for decision-making. Quantile regression enables this within spatial interpolation settings for merging remote sensing and gauge precipitation data. However, ensemble learning of quantile regression algorithms remains unexplored in this context. Here, we address this gap by introducing nine quantile-based ensemble learners and applying them to large precipitation datasets. We employed a novel feature engineering strategy, reducing predictors to distance-weighted satellite precipitation at relevant locations, combined with location elevation. Our ensemble learners include six stacking and three simple methods (mean, median, best combiner), combining six individual algorithms: quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). These algorithms serve as both base learners and combiners within different stacking methods. We evaluated performance against QR using quantile scoring functions in a large dataset comprising 15 years of monthly gauge-measured and satellite precipitation in contiguous US (CONUS). Stacking with QR and QRNN yielded the best results across quantile levels of interest (0.025, 0.050, 0.075, 0.100, 0.200, 0.300, 0.400, 0.500, 0.600, 0.700, 0.800, 0.900, 0.925, 0.950, 0.975), surpassing the reference method by 3.91% to 8.95%. This demonstrates the potential of stacking to improve probabilistic predictions in spatial interpolation and beyond.
- [18] arXiv:2403.10618 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Limits of Approximating the Median Treatment EffectSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Econometrics (econ.EM); Methodology (stat.ME)
Average Treatment Effect (ATE) estimation is a well-studied problem in causal inference. However, it does not necessarily capture the heterogeneity in the data, and several approaches have been proposed to tackle the issue, including estimating the Quantile Treatment Effects. In the finite population setting containing $n$ individuals, with treatment and control values denoted by the potential outcome vectors $\mathbf{a}, \mathbf{b}$, much of the prior work focused on estimating median$(\mathbf{a}) -$ median$(\mathbf{b})$, where median($\mathbf x$) denotes the median value in the sorted ordering of all the values in vector $\mathbf x$. It is known that estimating the difference of medians is easier than the desired estimand of median$(\mathbf{a-b})$, called the Median Treatment Effect (MTE). The fundamental problem of causal inference -- for every individual $i$, we can only observe one of the potential outcome values, i.e., either the value $a_i$ or $b_i$, but not both, makes estimating MTE particularly challenging. In this work, we argue that MTE is not estimable and detail a novel notion of approximation that relies on the sorted order of the values in $\mathbf{a-b}$. Next, we identify a quantity called variability that exactly captures the complexity of MTE estimation. By drawing connections to instance-optimality studied in theoretical computer science, we show that every algorithm for estimating the MTE obtains an approximation error that is no better than the error of an algorithm that computes variability. Finally, we provide a simple linear time algorithm for computing the variability exactly. Unlike much prior work, a particular highlight of our work is that we make no assumptions about how the potential outcome vectors are generated or how they are correlated, except that the potential outcome values are $k$-ary, i.e., take one of $k$ discrete values.
- [19] arXiv:2403.10766 (cross-list from cs.LG) [pdf, other]
-
Title: ODE Discovery for Longitudinal Heterogeneous Treatment Effects InferenceComments: Published in The Twelfth International Conference on Learning Representations (ICLR). Copyright 2024 by the author(s)Subjects: Machine Learning (cs.LG); Methodology (stat.ME)
Inferring unbiased treatment effects has received widespread attention in the machine learning community. In recent years, our community has proposed numerous solutions in standard settings, high-dimensional treatment settings, and even longitudinal settings. While very diverse, the solution has mostly relied on neural networks for inference and simultaneous correction of assignment bias. New approaches typically build on top of previous approaches by proposing new (or refined) architectures and learning algorithms. However, the end result -- a neural-network-based inference machine -- remains unchallenged. In this paper, we introduce a different type of solution in the longitudinal setting: a closed-form ordinary differential equation (ODE). While we still rely on continuous optimization to learn an ODE, the resulting inference machine is no longer a neural network. Doing so yields several advantages such as interpretability, irregular sampling, and a different set of identification assumptions. Above all, we consider the introduction of a completely new type of solution to be our most important contribution as it may spark entirely new innovations in treatment effects in general. We facilitate this by formulating our contribution as a framework that can transform any ODE discovery method into a treatment effects method.
- [20] arXiv:2403.11332 (cross-list from cs.LG) [pdf, other]
-
Title: Graph Neural Network based Double Machine Learning Estimator of Network Causal EffectsSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Methodology (stat.ME)
Our paper addresses the challenge of inferring causal effects in social network data, characterized by complex interdependencies among individuals resulting in challenges such as non-independence of units, interference (where a unit's outcome is affected by neighbors' treatments), and introduction of additional confounding factors from neighboring units. We propose a novel methodology combining graph neural networks and double machine learning, enabling accurate and efficient estimation of direct and peer effects using a single observational social network. Our approach utilizes graph isomorphism networks in conjunction with double machine learning to effectively adjust for network confounders and consistently estimate the desired causal effects. We demonstrate that our estimator is both asymptotically normal and semiparametrically efficient. A comprehensive evaluation against four state-of-the-art baseline methods using three semi-synthetic social network datasets reveals our method's on-par or superior efficacy in precise causal effect estimation. Further, we illustrate the practical application of our method through a case study that investigates the impact of Self-Help Group participation on financial risk tolerance. The results indicate a significant positive direct effect, underscoring the potential of our approach in social network analysis. Additionally, we explore the effects of network sparsity on estimation performance.
- [21] arXiv:2403.11343 (cross-list from cs.LG) [pdf, other]
-
Title: Federated Transfer Learning with Differential PrivacyComments: 76 pages, 3 figuresSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
Federated learning is gaining increasing popularity, with data heterogeneity and privacy being two prominent challenges. In this paper, we address both issues within a federated transfer learning framework, aiming to enhance learning on a target data set by leveraging information from multiple heterogeneous source data sets while adhering to privacy constraints. We rigorously formulate the notion of \textit{federated differential privacy}, which offers privacy guarantees for each data set without assuming a trusted central server. Under this privacy constraint, we study three classical statistical problems, namely univariate mean estimation, low-dimensional linear regression, and high-dimensional linear regression. By investigating the minimax rates and identifying the costs of privacy for these problems, we show that federated differential privacy is an intermediate privacy model between the well-established local and central models of differential privacy. Our analyses incorporate data heterogeneity and privacy, highlighting the fundamental costs of both in federated learning and underscoring the benefit of knowledge transfer across data sets.
Replacements for Tue, 19 Mar 24
- [22] arXiv:1805.05606 (replaced) [pdf, other]
-
Title: Nonparametric Bayesian volatility learning under microstructure noiseComments: 22 pages, 9 figuresJournal-ref: Jpn. J. Stat. Data. Sci 6, 551-571 (2023)Subjects: Methodology (stat.ME); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
- [23] arXiv:2203.06056 (replaced) [pdf, ps, other]
-
Title: Identifying Causal Effects using Instrumental Time Series: Nuisance IV and Correcting for the PastSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
- [24] arXiv:2206.12113 (replaced) [pdf, other]
-
Title: Sequential adaptive design for emulating costly computer codesSubjects: Methodology (stat.ME)
- [25] arXiv:2206.12525 (replaced) [pdf, other]
-
Title: Causality for Complex Continuous-time Functional Longitudinal StudiesAuthors: Andrew YingSubjects: Methodology (stat.ME); Probability (math.PR); Statistics Theory (math.ST)
- [26] arXiv:2207.00530 (replaced) [pdf, ps, other]
-
Title: The Target Study: A Conceptual Model and Framework for Measuring DisparityComments: Completely re-written for a clearer and more formal presentation with added results for generalizability and transportability and a more detailed comparison to alternative modelsSubjects: Methodology (stat.ME)
- [27] arXiv:2210.06927 (replaced) [pdf, other]
-
Title: Prediction can be safely used as a proxy for explanation in causally consistent Bayesian generalized linear modelsSubjects: Methodology (stat.ME)
- [28] arXiv:2211.01938 (replaced) [pdf, other]
-
Title: A family of mixture models for beta valued DNA methylation dataAuthors: Koyel Majumdar, Romina Silva, Antoinette Sabrina Perry, Ronald William Watson, Andrea Rau, Florence Jaffrezic, Thomas Brendan Murphy, Isobel Claire GormleyComments: 27 pages, 4 figuresSubjects: Methodology (stat.ME)
- [29] arXiv:2211.16059 (replaced) [pdf, ps, other]
-
Title: On Large-Scale Multiple Testing Over Networks: An Asymptotic ApproachComments: Published in the IEEE Transactions on Signal and Information Processing over NetworksSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
- [30] arXiv:2212.01900 (replaced) [pdf, other]
-
Title: Bayesian survival analysis with INLASubjects: Methodology (stat.ME)
- [31] arXiv:2302.13133 (replaced) [pdf, other]
-
Title: Data-driven uncertainty quantification for constrained stochastic differential equations and application to solar photovoltaic power forecast dataComments: 30 pages, 20 figuresSubjects: Methodology (stat.ME)
- [32] arXiv:2303.08528 (replaced) [pdf, other]
-
Title: Translating predictive distributions into informative priorsComments: Revised to shorten the main text considerablySubjects: Methodology (stat.ME)
- [33] arXiv:2303.10215 (replaced) [pdf, other]
-
Title: Statistical inference for association studies in the presence of binary outcome misclassificationComments: 58 pages, 5 figuresSubjects: Methodology (stat.ME)
- [34] arXiv:2303.15158 (replaced) [pdf, other]
-
Title: Discovering the Network Granger Causality in Large Vector Autoregressive ModelsSubjects: Methodology (stat.ME)
- [35] arXiv:2308.05577 (replaced) [pdf, other]
-
Title: Optimal Designs for Two-Stage InferenceSubjects: Methodology (stat.ME)
- [36] arXiv:2311.03829 (replaced) [pdf, ps, other]
-
Title: Multilevel mixtures of latent trait analyzers for clustering multi-layer bipartite networksComments: A version of the manuscript is in production at Multivariate Behavioral ResearchSubjects: Methodology (stat.ME)
- [37] arXiv:2402.15086 (replaced) [pdf, other]
-
Title: A modified debiased inverse-variance weighted estimator in two-sample summary-data Mendelian randomizationComments: 33 pages, 6 figuresSubjects: Methodology (stat.ME)
- [38] arXiv:2402.15705 (replaced) [pdf, other]
-
Title: A Variational Approach for Modeling High-dimensional Spatial Generalized Linear Mixed ModelsComments: 34 Pages for the main paper, 72 pages for the supplemental information, 4 tables, 5 figuresSubjects: Methodology (stat.ME)
- [39] arXiv:2306.14311 (replaced) [pdf, ps, other]
-
Title: Simple Estimation of Semiparametric Models with Measurement ErrorsSubjects: Econometrics (econ.EM); Methodology (stat.ME)
- [40] arXiv:2306.14862 (replaced) [pdf, other]
-
Title: Marginal Effects for Probit and Tobit with EndogeneitySubjects: Econometrics (econ.EM); Methodology (stat.ME)
- [41] arXiv:2312.12741 (replaced) [pdf, other]
-
Title: Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed Gaussian Bandits with Unknown VariancesAuthors: Masahiro KatoSubjects: Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
[ showing up to 2000 entries per page: fewer | more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2403, contact, help (Access key information)