We gratefully acknowledge support from
the Simons Foundation and member institutions.


New submissions

[ total of 30 entries: 1-30 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Tue, 19 Oct 21

[1]  arXiv:2110.08410 [pdf, ps, other]
Title: Covariate Adjustment in Regression Discontinuity Designs
Subjects: Methodology (stat.ME); Econometrics (econ.EM)

The Regression Discontinuity (RD) design is a widely used non-experimental method for causal inference and program evaluation. While its canonical formulation only requires a score and an outcome variable, it is common in empirical work to encounter RD implementations where additional variables are used for adjustment. This practice has led to misconceptions about the role of covariate adjustment in RD analysis, from both methodological and empirical perspectives. In this chapter, we review the different roles of covariate adjustment in RD designs, and offer methodological guidance for its correct use in applications.

[2]  arXiv:2110.08411 [pdf, other]
Title: Multi-group Gaussian Processes
Subjects: Methodology (stat.ME); Applications (stat.AP)

Gaussian processes (GPs) are pervasive in functional data analysis, machine learning, and spatial statistics for modeling complex dependencies. Modern scientific data sets are typically heterogeneous and often contain multiple known discrete subgroups of samples. For example, in genomics applications samples may be grouped according to tissue type or drug exposure. In the modeling process it is desirable to leverage the similarity among groups while accounting for differences between them. While a substantial literature exists for GPs over Euclidean domains $\mathbb{R}^p$, GPs on domains suitable for multi-group data remain less explored. Here, we develop a multi-group Gaussian process (MGGP), which we define on $\mathbb{R}^p\times \mathscr{C}$, where $\mathscr{C}$ is a finite set representing the group label. We provide general methods to construct valid (positive definite) covariance functions on this domain, and we describe algorithms for inference, estimation, and prediction. We perform simulation experiments and apply MGGP to gene expression data to illustrate the behavior and advantages of the MGGP in the joint modeling of continuous and categorical variables.

[3]  arXiv:2110.08425 [pdf, other]
Title: Exact Bias Correction for Linear Adjustment of Randomized Controlled Trials
Subjects: Methodology (stat.ME); Econometrics (econ.EM)

In an influential critique of empirical practice, Freedman \cite{freedman2008A,freedman2008B} showed that the linear regression estimator was biased for the analysis of randomized controlled trials under the randomization model. Under Freedman's assumptions, we derive exact closed-form bias corrections for the linear regression estimator with and without treatment-by-covariate interactions. We show that the limiting distribution of the bias corrected estimator is identical to the uncorrected estimator, implying that the asymptotic gains from adjustment can be attained without introducing any risk of bias. Taken together with results from Lin \cite{lin2013agnostic}, our results show that Freedman's theoretical arguments against the use of regression adjustment can be completely resolved with minor modifications to practice.

[4]  arXiv:2110.08570 [pdf, other]
Title: A Reduced-Bias Weighted least square estimation of the Extreme Value Index
Comments: 24 pages
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)

In this paper, we propose a reduced-bias estimator of the EVI for Pareto-type tails (heavy-tailed) distributions. This is derived using the weighted least squares method. It is shown that the estimator is unbiased, consistent and asymptotically normal under the second-order conditions on the underlying distribution of the data. The finite sample properties of the proposed estimator are studied through a simulation study. The results show that it is competitive to the existing estimators of the extreme value index in terms of bias and Mean Square Error. In addition, it yields estimates of $\gamma>0$ that are less sensitive to the number of top-order statistics, and hence, can be used for selecting an optimal tail fraction. The proposed estimator is further illustrated using practical datasets from pedochemical and insurance.

[5]  arXiv:2110.08665 [pdf, other]
Title: Quantile Regression by Dyadic CART
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

In this paper we propose and study a version of the Dyadic Classification and Regression Trees (DCART) estimator from Donoho (1997) for (fixed design) quantile regression in general dimensions. We refer to this proposed estimator as the QDCART estimator. Just like the mean regression version, we show that a) a fast dynamic programming based algorithm with computational complexity $O(N \log N)$ exists for computing the QDCART estimator and b) an oracle risk bound (trading off squared error and a complexity parameter of the true signal) holds for the QDCART estimator. This oracle risk bound then allows us to demonstrate that the QDCART estimator enjoys adaptively rate optimal estimation guarantees for piecewise constant and bounded variation function classes. In contrast to existing results for the DCART estimator which requires subgaussianity of the error distribution, for our estimation guarantees to hold we do not need any restrictive tail decay assumptions on the error distribution. For instance, our results hold even when the error distribution has no first moment such as the Cauchy distribution. Apart from the Dyadic CART method, we also consider other variant methods such as the Optimal Regression Tree (ORT) estimator introduced in Chatterjee and Goswami (2019). In particular, we also extend the ORT estimator to the quantile setting and establish that it enjoys analogous guarantees. Thus, this paper extends the scope of these globally optimal regression tree based methodologies to be applicable for heavy tailed data. We then perform extensive numerical experiments on both simulated and real data which illustrate the usefulness of the proposed methods.

[6]  arXiv:2110.08747 [pdf, ps, other]
Title: JEL ratio test for independence of time to failure and cause of failure in competing risks
Subjects: Methodology (stat.ME)

In the present article, we propose jackknife empirical likelihood (JEL) ratio test for testing the independence of time to failure and cause of failure in competing risks data. We use U-statistic theory to derive the JEL ratio test. The asymptotic distribution of the test statistic is shown to be chi-square distribution with one degree of freedom. A Monte Carlo simulation study is carried out to assess the finite sample behaviour of the proposed test. The performance of proposed JEL test is compared with the test given in Dewan et al. (2004). Finally we illustrate our test procedure using various real data sets.

[7]  arXiv:2110.08970 [pdf, other]
Title: Sample size calculations for n-of-1 trials
Subjects: Methodology (stat.ME); Applications (stat.AP)

N-of-1 trials, single participant trials in which multiple treatments are sequentially randomized over the study period, can give direct estimates of individual-specific treatment effects. Combining n-of-1 trials gives extra information for estimating the population average treatment effect compared with randomized controlled trials and increases precision for individual-specific treatment effect estimates. In this paper, we present a procedure for designing n-of-1 trials. We formally define the design components for determining the sample size of a series of n-of-1 trials, present models for analyzing these trials and use them to derive the sample size formula for estimating the population average treatment effect and the standard error of the individual-specific treatment effect estimates. We recommend first finding the possible designs that will satisfy the power requirement for estimating the population average treatment effect and then, if of interest, finalizing the design to also satisfy the standard error requirements for the individual-specific treatment effect estimates. The procedure is implemented and illustrated in the paper and through a Shiny app.

[8]  arXiv:2110.09040 [pdf, ps, other]
Title: A Bayesian approach to multi-task learning with network lasso
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)

Network lasso is a method for solving a multi-task learning problem through the regularized maximum likelihood method. A characteristic of network lasso is setting a different model for each sample. The relationships among the models are represented by relational coefficients. A crucial issue in network lasso is to provide appropriate values for these relational coefficients. In this paper, we propose a Bayesian approach to solve multi-task learning problems by network lasso. This approach allows us to objectively determine the relational coefficients by Bayesian estimation. The effectiveness of the proposed method is shown in a simulation study and a real data analysis.

[9]  arXiv:2110.09115 [pdf, other]
Title: Optimal designs for experiments for scalar-on-function linear models
Subjects: Methodology (stat.ME)

The aim of this work is to extend the usual optimal experimental design paradigm to experiments where the settings of one or more factors are functions. For these new experiments, a design consists of combinations of functions for each run of the experiment along with settings for non-functional variables. After briefly introducing the class of functional variables, basis function systems are described. Basis function expansion is applied to a functional linear model consisting of both functional and scalar factors, reducing the problem to an optimisation problem of a single design matrix.

[10]  arXiv:2110.09143 [pdf, other]
Title: Variance Reduction in Stochastic Reaction Networks using Control Variates
Comments: arXiv admin note: substantial text overlap with arXiv:1905.00854
Subjects: Methodology (stat.ME); Systems and Control (eess.SY); Molecular Networks (q-bio.MN); Quantitative Methods (q-bio.QM)

Monte Carlo estimation in plays a crucial role in stochastic reaction networks. However, reducing the statistical uncertainty of the corresponding estimators requires sampling a large number of trajectories. We propose control variates based on the statistical moments of the process to reduce the estimators' variances. We develop an algorithm that selects an efficient subset of infinitely many control variates. To this end, the algorithm uses resampling and a redundancy-aware greedy selection. We demonstrate the efficiency of our approach in several case studies.

[11]  arXiv:2110.09275 [pdf, ps, other]
Title: Double Robust Mass-Imputation with Matching Estimators
Authors: Ali Furkan Kalay
Subjects: Methodology (stat.ME)

This paper proposes using a method named Double Score Matching (DSM) to do mass-imputation and presents an application to make inferences with a nonprobability sample. DSM is a $k$-Nearest Neighbors algorithm that uses two balance scores instead of covariates to reduce the dimension of the distance metric and thus to achieve a faster convergence rate. DSM mass-imputation and population inference are consistent if one of two balance score models is correctly specified. Simulation results show that the DSM performs better than recently developed double robust estimators when the data generating process has nonlinear confounders. The nonlinearity of the DGP is a major concern because it cannot be tested, and it leads to a violation of the assumptions required to achieve consistency. Even if the consistency of the DSM relies on the two modeling assumptions, it prevents bias from inflating under such cases because DSM is a semiparametric estimator. The confidence intervals are constructed using a wild bootstrapping approach. The proposed bootstrapping method generates valid confidence intervals as long as DSM is consistent.

[12]  arXiv:2110.09382 [pdf, other]
Title: Frequentist-Bayes Hybrid Covariance Estimationfor Unfolding Problems
Subjects: Methodology (stat.ME); High Energy Physics - Experiment (hep-ex)

In this paper we present a frequentist-Bayesian hybrid method for estimating covariances of unfolded distributions using pseudo-experiments. The method is compared with other covariance estimation methods using the unbiased Rao-Cramer bound (RCB) and frequentist pseudo-experiments. We show that the unbiased RCB method diverges from the other two methods when regularization is introduced. The new hybrid method agrees well with the frequentist pseudo-experiment method for various amounts of regularization. However, the hybrid method has the added advantage of not requiring a clear likelihood definition and can be used in combination with any unfolding algorithm that uses a response matrix to model the detector response.

Cross-lists for Tue, 19 Oct 21

[13]  arXiv:2110.08331 (cross-list from cs.LG) [pdf, other]
Title: A New Approach for Interpretability and Reliability in Clinical Risk Prediction: Acute Coronary Syndrome Scenario
Comments: Accepted for publication in the Artificial Intelligence in Medicine journal. Abstract abridged to respect the arXiv's characters limit
Journal-ref: Artificial Intelligence in Medicine, Volume 117, 2021
Subjects: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)

We intend to create a new risk assessment methodology that combines the best characteristics of both risk score and machine learning models. More specifically, we aim to develop a method that, besides having a good performance, offers a personalized model and outcome for each patient, presents high interpretability, and incorporates an estimation of the prediction reliability which is not usually available. By combining these features in the same approach we expect that it can boost the confidence of physicians to use such a tool in their daily activity. In order to achieve the mentioned goals, a three-step methodology was developed: several rules were created by dichotomizing risk factors; such rules were trained with a machine learning classifier to predict the acceptance degree of each rule (the probability that the rule is correct) for each patient; that information was combined and used to compute the risk of mortality and the reliability of such prediction. The methodology was applied to a dataset of patients admitted with any type of acute coronary syndromes (ACS), to assess the 30-days all-cause mortality risk. The performance was compared with state-of-the-art approaches: logistic regression (LR), artificial neural network (ANN), and clinical risk score model (Global Registry of Acute Coronary Events - GRACE). The proposed approach achieved testing results identical to the standard LR, but offers superior interpretability and personalization; it also significantly outperforms the GRACE risk model and the standard ANN model. The calibration curve also suggests a very good generalization ability of the obtained model as it approaches the ideal curve. Finally, the reliability estimation of individual predictions presented a great correlation with the misclassifications rate. Those properties may have a beneficial application in other clinical scenarios as well. [abridged]

[14]  arXiv:2110.08505 (cross-list from stat.ML) [pdf, other]
Title: Mode and Ridge Estimation in Euclidean and Directional Product Spaces: A Mean Shift Approach
Comments: 51 pages, 10 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

The set of local modes and the ridge lines estimated from a dataset are important summary characteristics of the data-generating distribution. In this work, we consider estimating the local modes and ridges from point cloud data in a product space with two or more Euclidean/directional metric spaces. Specifically, we generalize the well-known (subspace constrained) mean shift algorithm to the product space setting and illuminate some pitfalls in such generalization. We derive the algorithmic convergence of the proposed method, provide practical guidelines on the implementation, and demonstrate its effectiveness on both simulated and real datasets.

[15]  arXiv:2110.08884 (cross-list from stat.ML) [pdf, other]
Title: Persuasion by Dimension Reduction
Comments: arXiv admin note: text overlap with arXiv:2102.10909
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); General Economics (econ.GN); Statistics Theory (math.ST); Methodology (stat.ME)

How should an agent (the sender) observing multi-dimensional data (the state vector) persuade another agent to take the desired action? We show that it is always optimal for the sender to perform a (non-linear) dimension reduction by projecting the state vector onto a lower-dimensional object that we call the "optimal information manifold." We characterize geometric properties of this manifold and link them to the sender's preferences. Optimal policy splits information into "good" and "bad" components. When the sender's marginal utility is linear, revealing the full magnitude of good information is always optimal. In contrast, with concave marginal utility, optimal information design conceals the extreme realizations of good information and only reveals its direction (sign). We illustrate these effects by explicitly solving several multi-dimensional Bayesian persuasion problems.

[16]  arXiv:2110.08905 (cross-list from stat.AP) [pdf, other]
Title: Exploitation of error correlation in a large analysis validation: GlobCurrent case study
Comments: 24 pages, 14 figures
Journal-ref: Remote Sens. Environ., 217, 476-490 (2018)
Subjects: Applications (stat.AP); Statistics Theory (math.ST); Methodology (stat.ME)

An assessment of variance in ocean current signal and noise shared by in situ observations (drifters) and a large gridded analysis (GlobCurrent) is sought as a function of day of the year for 1993-2015 and across a broad spectrum of current speed. Regardless of the division of collocations, it is difficult to claim that any synoptic assessment can be based on independent observations. Instead, a measurement model that departs from ordinary linear regression by accommodating error correlation is proposed. The interpretation of independence is explored by applying Fuller's (1987) concept of equation and measurement error to a division of error into shared (correlated) and unshared (uncorrelated) components, respectively. The resulting division of variance in the new model favours noise. Ocean current shared (equation) error is of comparable magnitude to unshared (measurement) error and the latter is, for GlobCurrent and drifters respectively, comparable to ordinary and reverse linear regression. Although signal variance appears to be small, its utility as a measure of agreement between two variates is highlighted.
Sparse collocations that sample a dense grid permit a first order autoregressive form of measurement model to be considered, including parameterizations of analysis-in situ error cross-correlation and analysis temporal error autocorrelation. The former (cross-correlation) is an equation error term that accommodates error shared by both GlobCurrent and drifters. The latter (autocorrelation) facilitates an identification and retrieval of all model parameters. Solutions are sought using a prescribed calibration between GlobCurrent and drifters (by variance matching). Because the true current variance of GlobCurrent and drifters is small, signal to noise ratio is near zero at best. This is particularly evident for moderate current speed and meridional current component.

[17]  arXiv:2110.08969 (cross-list from stat.AP) [pdf, ps, other]
Title: On completing a measurement model by symmetry
Comments: 4 pages
Subjects: Applications (stat.AP); Statistics Theory (math.ST); Methodology (stat.ME)

An appeal for symmetry is made to build established notions of specific representation and specific nonlinearity of measurement (often called model error) into a canonical linear regression model. Additive components are derived from the trivially complete model M = m. Factor analysis and equation error motivate corresponding notions of representation and nonlinearity in an errors-in-variables framework, with a novel interpretation of terms. It is suggested that a modern interpretation of correlation involves both linear and nonlinear association.

[18]  arXiv:2110.09192 (cross-list from cs.LG) [pdf, other]
Title: Learning Optimal Conformal Classifiers
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME); Machine Learning (stat.ML)

Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's probability estimates to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. We show that CT outperforms state-of-the-art CP methods for classification by reducing the average confidence set size (inefficiency). Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.

Replacements for Tue, 19 Oct 21

[19]  arXiv:1911.09171 (replaced) [pdf, other]
Title: Re-Evaluating Strengthened-IV Designs: Asymptotic Efficiency, Bias Formula, and the Validity and Power of Sensitivity Analyses
Comments: 86 pages, 4 figures, 6 tables
Subjects: Methodology (stat.ME); Applications (stat.AP)
[20]  arXiv:2005.12556 (replaced) [pdf, other]
Title: Truncating the Exponential with a Uniform Distribution
Subjects: Methodology (stat.ME)
[21]  arXiv:2007.14190 (replaced) [pdf, other]
Title: Variable Selection for Doubly Robust Causal Inference
Subjects: Methodology (stat.ME)
[22]  arXiv:2012.11026 (replaced) [pdf]
Title: Independent Approximates enable closed-form parameter estimation of heavy-tailed distributions
Authors: Kenric P. Nelson
Comments: 30 pages, 8 figures, 7 tables
Subjects: Methodology (stat.ME); Information Theory (cs.IT); Data Analysis, Statistics and Probability (physics.data-an)
[23]  arXiv:2103.01621 (replaced) [pdf, other]
Title: Fast selection of nonlinear mixed effect models using penalized likelihood
Authors: Edouard Ollier
Subjects: Methodology (stat.ME); Computation (stat.CO)
[24]  arXiv:2104.07084 (replaced) [pdf, other]
Title: Grouped Variable Selection with Discrete Optimization: Computational and Statistical Perspectives
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
[25]  arXiv:2109.11307 (replaced) [pdf, other]
Title: Semiparametric bivariate extreme-value copulas
Comments: 23 pages, 22 figures
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Computation (stat.CO)
[26]  arXiv:2110.04433 (replaced) [pdf, ps, other]
Title: De-biased Lasso for Generalized Linear Models with A Diverging Number of Covariates
Authors: Lu Xia, Bin Nan, Yi Li
Comments: arXiv admin note: text overlap with arXiv:2006.12778
Subjects: Methodology (stat.ME)
[27]  arXiv:2106.09769 (replaced) [pdf, other]
Title: Generalized regression operator estimation for continuous time functional data processes with missing at random response
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)
[28]  arXiv:2106.10624 (replaced) [pdf]
Title: Combined tests based on restricted mean time lost for competing risks data
Comments: 26 pages, 3 figures
Journal-ref: Statistics in Biopharmaceutical Research, 2021
Subjects: Applications (stat.AP); Methodology (stat.ME)
[29]  arXiv:2110.01571 (replaced) [pdf, other]
Title: Learning Causal Representation for Face Transfer across Large Appearance Gap
Subjects: Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
[30]  arXiv:2110.01593 (replaced) [pdf, other]
Title: Generalized Kernel Thinning
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
[ total of 30 entries: 1-30 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2110, contact, help  (Access key information)