Methodology
New submissions
[ showing up to 2000 entries per page: fewer  more ]
New submissions for Tue, 19 Oct 21
 [1] arXiv:2110.08410 [pdf, ps, other]

Title: Covariate Adjustment in Regression Discontinuity DesignsSubjects: Methodology (stat.ME); Econometrics (econ.EM)
The Regression Discontinuity (RD) design is a widely used nonexperimental method for causal inference and program evaluation. While its canonical formulation only requires a score and an outcome variable, it is common in empirical work to encounter RD implementations where additional variables are used for adjustment. This practice has led to misconceptions about the role of covariate adjustment in RD analysis, from both methodological and empirical perspectives. In this chapter, we review the different roles of covariate adjustment in RD designs, and offer methodological guidance for its correct use in applications.
 [2] arXiv:2110.08411 [pdf, other]

Title: Multigroup Gaussian ProcessesSubjects: Methodology (stat.ME); Applications (stat.AP)
Gaussian processes (GPs) are pervasive in functional data analysis, machine learning, and spatial statistics for modeling complex dependencies. Modern scientific data sets are typically heterogeneous and often contain multiple known discrete subgroups of samples. For example, in genomics applications samples may be grouped according to tissue type or drug exposure. In the modeling process it is desirable to leverage the similarity among groups while accounting for differences between them. While a substantial literature exists for GPs over Euclidean domains $\mathbb{R}^p$, GPs on domains suitable for multigroup data remain less explored. Here, we develop a multigroup Gaussian process (MGGP), which we define on $\mathbb{R}^p\times \mathscr{C}$, where $\mathscr{C}$ is a finite set representing the group label. We provide general methods to construct valid (positive definite) covariance functions on this domain, and we describe algorithms for inference, estimation, and prediction. We perform simulation experiments and apply MGGP to gene expression data to illustrate the behavior and advantages of the MGGP in the joint modeling of continuous and categorical variables.
 [3] arXiv:2110.08425 [pdf, other]

Title: Exact Bias Correction for Linear Adjustment of Randomized Controlled TrialsSubjects: Methodology (stat.ME); Econometrics (econ.EM)
In an influential critique of empirical practice, Freedman \cite{freedman2008A,freedman2008B} showed that the linear regression estimator was biased for the analysis of randomized controlled trials under the randomization model. Under Freedman's assumptions, we derive exact closedform bias corrections for the linear regression estimator with and without treatmentbycovariate interactions. We show that the limiting distribution of the bias corrected estimator is identical to the uncorrected estimator, implying that the asymptotic gains from adjustment can be attained without introducing any risk of bias. Taken together with results from Lin \cite{lin2013agnostic}, our results show that Freedman's theoretical arguments against the use of regression adjustment can be completely resolved with minor modifications to practice.
 [4] arXiv:2110.08570 [pdf, other]

Title: A ReducedBias Weighted least square estimation of the Extreme Value IndexComments: 24 pagesSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
In this paper, we propose a reducedbias estimator of the EVI for Paretotype tails (heavytailed) distributions. This is derived using the weighted least squares method. It is shown that the estimator is unbiased, consistent and asymptotically normal under the secondorder conditions on the underlying distribution of the data. The finite sample properties of the proposed estimator are studied through a simulation study. The results show that it is competitive to the existing estimators of the extreme value index in terms of bias and Mean Square Error. In addition, it yields estimates of $\gamma>0$ that are less sensitive to the number of toporder statistics, and hence, can be used for selecting an optimal tail fraction. The proposed estimator is further illustrated using practical datasets from pedochemical and insurance.
 [5] arXiv:2110.08665 [pdf, other]

Title: Quantile Regression by Dyadic CARTSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
In this paper we propose and study a version of the Dyadic Classification and Regression Trees (DCART) estimator from Donoho (1997) for (fixed design) quantile regression in general dimensions. We refer to this proposed estimator as the QDCART estimator. Just like the mean regression version, we show that a) a fast dynamic programming based algorithm with computational complexity $O(N \log N)$ exists for computing the QDCART estimator and b) an oracle risk bound (trading off squared error and a complexity parameter of the true signal) holds for the QDCART estimator. This oracle risk bound then allows us to demonstrate that the QDCART estimator enjoys adaptively rate optimal estimation guarantees for piecewise constant and bounded variation function classes. In contrast to existing results for the DCART estimator which requires subgaussianity of the error distribution, for our estimation guarantees to hold we do not need any restrictive tail decay assumptions on the error distribution. For instance, our results hold even when the error distribution has no first moment such as the Cauchy distribution. Apart from the Dyadic CART method, we also consider other variant methods such as the Optimal Regression Tree (ORT) estimator introduced in Chatterjee and Goswami (2019). In particular, we also extend the ORT estimator to the quantile setting and establish that it enjoys analogous guarantees. Thus, this paper extends the scope of these globally optimal regression tree based methodologies to be applicable for heavy tailed data. We then perform extensive numerical experiments on both simulated and real data which illustrate the usefulness of the proposed methods.
 [6] arXiv:2110.08747 [pdf, ps, other]

Title: JEL ratio test for independence of time to failure and cause of failure in competing risksSubjects: Methodology (stat.ME)
In the present article, we propose jackknife empirical likelihood (JEL) ratio test for testing the independence of time to failure and cause of failure in competing risks data. We use Ustatistic theory to derive the JEL ratio test. The asymptotic distribution of the test statistic is shown to be chisquare distribution with one degree of freedom. A Monte Carlo simulation study is carried out to assess the finite sample behaviour of the proposed test. The performance of proposed JEL test is compared with the test given in Dewan et al. (2004). Finally we illustrate our test procedure using various real data sets.
 [7] arXiv:2110.08970 [pdf, other]

Title: Sample size calculations for nof1 trialsSubjects: Methodology (stat.ME); Applications (stat.AP)
Nof1 trials, single participant trials in which multiple treatments are sequentially randomized over the study period, can give direct estimates of individualspecific treatment effects. Combining nof1 trials gives extra information for estimating the population average treatment effect compared with randomized controlled trials and increases precision for individualspecific treatment effect estimates. In this paper, we present a procedure for designing nof1 trials. We formally define the design components for determining the sample size of a series of nof1 trials, present models for analyzing these trials and use them to derive the sample size formula for estimating the population average treatment effect and the standard error of the individualspecific treatment effect estimates. We recommend first finding the possible designs that will satisfy the power requirement for estimating the population average treatment effect and then, if of interest, finalizing the design to also satisfy the standard error requirements for the individualspecific treatment effect estimates. The procedure is implemented and illustrated in the paper and through a Shiny app.
 [8] arXiv:2110.09040 [pdf, ps, other]

Title: A Bayesian approach to multitask learning with network lassoSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Network lasso is a method for solving a multitask learning problem through the regularized maximum likelihood method. A characteristic of network lasso is setting a different model for each sample. The relationships among the models are represented by relational coefficients. A crucial issue in network lasso is to provide appropriate values for these relational coefficients. In this paper, we propose a Bayesian approach to solve multitask learning problems by network lasso. This approach allows us to objectively determine the relational coefficients by Bayesian estimation. The effectiveness of the proposed method is shown in a simulation study and a real data analysis.
 [9] arXiv:2110.09115 [pdf, other]

Title: Optimal designs for experiments for scalaronfunction linear modelsSubjects: Methodology (stat.ME)
The aim of this work is to extend the usual optimal experimental design paradigm to experiments where the settings of one or more factors are functions. For these new experiments, a design consists of combinations of functions for each run of the experiment along with settings for nonfunctional variables. After briefly introducing the class of functional variables, basis function systems are described. Basis function expansion is applied to a functional linear model consisting of both functional and scalar factors, reducing the problem to an optimisation problem of a single design matrix.
 [10] arXiv:2110.09143 [pdf, other]

Title: Variance Reduction in Stochastic Reaction Networks using Control VariatesComments: arXiv admin note: substantial text overlap with arXiv:1905.00854Subjects: Methodology (stat.ME); Systems and Control (eess.SY); Molecular Networks (qbio.MN); Quantitative Methods (qbio.QM)
Monte Carlo estimation in plays a crucial role in stochastic reaction networks. However, reducing the statistical uncertainty of the corresponding estimators requires sampling a large number of trajectories. We propose control variates based on the statistical moments of the process to reduce the estimators' variances. We develop an algorithm that selects an efficient subset of infinitely many control variates. To this end, the algorithm uses resampling and a redundancyaware greedy selection. We demonstrate the efficiency of our approach in several case studies.
 [11] arXiv:2110.09275 [pdf, ps, other]

Title: Double Robust MassImputation with Matching EstimatorsAuthors: Ali Furkan KalaySubjects: Methodology (stat.ME)
This paper proposes using a method named Double Score Matching (DSM) to do massimputation and presents an application to make inferences with a nonprobability sample. DSM is a $k$Nearest Neighbors algorithm that uses two balance scores instead of covariates to reduce the dimension of the distance metric and thus to achieve a faster convergence rate. DSM massimputation and population inference are consistent if one of two balance score models is correctly specified. Simulation results show that the DSM performs better than recently developed double robust estimators when the data generating process has nonlinear confounders. The nonlinearity of the DGP is a major concern because it cannot be tested, and it leads to a violation of the assumptions required to achieve consistency. Even if the consistency of the DSM relies on the two modeling assumptions, it prevents bias from inflating under such cases because DSM is a semiparametric estimator. The confidence intervals are constructed using a wild bootstrapping approach. The proposed bootstrapping method generates valid confidence intervals as long as DSM is consistent.
 [12] arXiv:2110.09382 [pdf, other]

Title: FrequentistBayes Hybrid Covariance Estimationfor Unfolding ProblemsAuthors: Pim Jordi VerschuurenSubjects: Methodology (stat.ME); High Energy Physics  Experiment (hepex)
In this paper we present a frequentistBayesian hybrid method for estimating covariances of unfolded distributions using pseudoexperiments. The method is compared with other covariance estimation methods using the unbiased RaoCramer bound (RCB) and frequentist pseudoexperiments. We show that the unbiased RCB method diverges from the other two methods when regularization is introduced. The new hybrid method agrees well with the frequentist pseudoexperiment method for various amounts of regularization. However, the hybrid method has the added advantage of not requiring a clear likelihood definition and can be used in combination with any unfolding algorithm that uses a response matrix to model the detector response.
Crosslists for Tue, 19 Oct 21
 [13] arXiv:2110.08331 (crosslist from cs.LG) [pdf, other]

Title: A New Approach for Interpretability and Reliability in Clinical Risk Prediction: Acute Coronary Syndrome ScenarioAuthors: Francisco Valente, Jorge Henriques, Simão Paredes, Teresa Rocha, Paulo de Carvalho, João MoraisComments: Accepted for publication in the Artificial Intelligence in Medicine journal. Abstract abridged to respect the arXiv's characters limitJournalref: Artificial Intelligence in Medicine, Volume 117, 2021Subjects: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
We intend to create a new risk assessment methodology that combines the best characteristics of both risk score and machine learning models. More specifically, we aim to develop a method that, besides having a good performance, offers a personalized model and outcome for each patient, presents high interpretability, and incorporates an estimation of the prediction reliability which is not usually available. By combining these features in the same approach we expect that it can boost the confidence of physicians to use such a tool in their daily activity. In order to achieve the mentioned goals, a threestep methodology was developed: several rules were created by dichotomizing risk factors; such rules were trained with a machine learning classifier to predict the acceptance degree of each rule (the probability that the rule is correct) for each patient; that information was combined and used to compute the risk of mortality and the reliability of such prediction. The methodology was applied to a dataset of patients admitted with any type of acute coronary syndromes (ACS), to assess the 30days allcause mortality risk. The performance was compared with stateoftheart approaches: logistic regression (LR), artificial neural network (ANN), and clinical risk score model (Global Registry of Acute Coronary Events  GRACE). The proposed approach achieved testing results identical to the standard LR, but offers superior interpretability and personalization; it also significantly outperforms the GRACE risk model and the standard ANN model. The calibration curve also suggests a very good generalization ability of the obtained model as it approaches the ideal curve. Finally, the reliability estimation of individual predictions presented a great correlation with the misclassifications rate. Those properties may have a beneficial application in other clinical scenarios as well. [abridged]
 [14] arXiv:2110.08505 (crosslist from stat.ML) [pdf, other]

Title: Mode and Ridge Estimation in Euclidean and Directional Product Spaces: A Mean Shift ApproachComments: 51 pages, 10 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
The set of local modes and the ridge lines estimated from a dataset are important summary characteristics of the datagenerating distribution. In this work, we consider estimating the local modes and ridges from point cloud data in a product space with two or more Euclidean/directional metric spaces. Specifically, we generalize the wellknown (subspace constrained) mean shift algorithm to the product space setting and illuminate some pitfalls in such generalization. We derive the algorithmic convergence of the proposed method, provide practical guidelines on the implementation, and demonstrate its effectiveness on both simulated and real datasets.
 [15] arXiv:2110.08884 (crosslist from stat.ML) [pdf, other]

Title: Persuasion by Dimension ReductionComments: arXiv admin note: text overlap with arXiv:2102.10909Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); General Economics (econ.GN); Statistics Theory (math.ST); Methodology (stat.ME)
How should an agent (the sender) observing multidimensional data (the state vector) persuade another agent to take the desired action? We show that it is always optimal for the sender to perform a (nonlinear) dimension reduction by projecting the state vector onto a lowerdimensional object that we call the "optimal information manifold." We characterize geometric properties of this manifold and link them to the sender's preferences. Optimal policy splits information into "good" and "bad" components. When the sender's marginal utility is linear, revealing the full magnitude of good information is always optimal. In contrast, with concave marginal utility, optimal information design conceals the extreme realizations of good information and only reveals its direction (sign). We illustrate these effects by explicitly solving several multidimensional Bayesian persuasion problems.
 [16] arXiv:2110.08905 (crosslist from stat.AP) [pdf, other]

Title: Exploitation of error correlation in a large analysis validation: GlobCurrent case studyAuthors: Richard E. Danielson, Johnny A. Johannessen, Graham D. Quartly, MarieHélène Rio, Bertrand Chapron, Fabrice Collard, Craig DonlonComments: 24 pages, 14 figuresJournalref: Remote Sens. Environ., 217, 476490 (2018)Subjects: Applications (stat.AP); Statistics Theory (math.ST); Methodology (stat.ME)
An assessment of variance in ocean current signal and noise shared by in situ observations (drifters) and a large gridded analysis (GlobCurrent) is sought as a function of day of the year for 19932015 and across a broad spectrum of current speed. Regardless of the division of collocations, it is difficult to claim that any synoptic assessment can be based on independent observations. Instead, a measurement model that departs from ordinary linear regression by accommodating error correlation is proposed. The interpretation of independence is explored by applying Fuller's (1987) concept of equation and measurement error to a division of error into shared (correlated) and unshared (uncorrelated) components, respectively. The resulting division of variance in the new model favours noise. Ocean current shared (equation) error is of comparable magnitude to unshared (measurement) error and the latter is, for GlobCurrent and drifters respectively, comparable to ordinary and reverse linear regression. Although signal variance appears to be small, its utility as a measure of agreement between two variates is highlighted.
Sparse collocations that sample a dense grid permit a first order autoregressive form of measurement model to be considered, including parameterizations of analysisin situ error crosscorrelation and analysis temporal error autocorrelation. The former (crosscorrelation) is an equation error term that accommodates error shared by both GlobCurrent and drifters. The latter (autocorrelation) facilitates an identification and retrieval of all model parameters. Solutions are sought using a prescribed calibration between GlobCurrent and drifters (by variance matching). Because the true current variance of GlobCurrent and drifters is small, signal to noise ratio is near zero at best. This is particularly evident for moderate current speed and meridional current component.  [17] arXiv:2110.08969 (crosslist from stat.AP) [pdf, ps, other]

Title: On completing a measurement model by symmetryAuthors: Richard E. DanielsonComments: 4 pagesSubjects: Applications (stat.AP); Statistics Theory (math.ST); Methodology (stat.ME)
An appeal for symmetry is made to build established notions of specific representation and specific nonlinearity of measurement (often called model error) into a canonical linear regression model. Additive components are derived from the trivially complete model M = m. Factor analysis and equation error motivate corresponding notions of representation and nonlinearity in an errorsinvariables framework, with a novel interpretation of terms. It is suggested that a modern interpretation of correlation involves both linear and nonlinear association.
 [18] arXiv:2110.09192 (crosslist from cs.LG) [pdf, other]

Title: Learning Optimal Conformal ClassifiersSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME); Machine Learning (stat.ML)
Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in highstake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's probability estimates to predict confidence sets containing the true class with a userspecified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper endtoend. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on minibatches during training. We show that CT outperforms stateoftheart CP methods for classification by reducing the average confidence set size (inefficiency). Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.
Replacements for Tue, 19 Oct 21
 [19] arXiv:1911.09171 (replaced) [pdf, other]

Title: ReEvaluating StrengthenedIV Designs: Asymptotic Efficiency, Bias Formula, and the Validity and Power of Sensitivity AnalysesComments: 86 pages, 4 figures, 6 tablesSubjects: Methodology (stat.ME); Applications (stat.AP)
 [20] arXiv:2005.12556 (replaced) [pdf, other]

Title: Truncating the Exponential with a Uniform DistributionSubjects: Methodology (stat.ME)
 [21] arXiv:2007.14190 (replaced) [pdf, other]

Title: Variable Selection for Doubly Robust Causal InferenceSubjects: Methodology (stat.ME)
 [22] arXiv:2012.11026 (replaced) [pdf]

Title: Independent Approximates enable closedform parameter estimation of heavytailed distributionsAuthors: Kenric P. NelsonComments: 30 pages, 8 figures, 7 tablesSubjects: Methodology (stat.ME); Information Theory (cs.IT); Data Analysis, Statistics and Probability (physics.dataan)
 [23] arXiv:2103.01621 (replaced) [pdf, other]

Title: Fast selection of nonlinear mixed effect models using penalized likelihoodAuthors: Edouard OllierSubjects: Methodology (stat.ME); Computation (stat.CO)
 [24] arXiv:2104.07084 (replaced) [pdf, other]

Title: Grouped Variable Selection with Discrete Optimization: Computational and Statistical PerspectivesSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
 [25] arXiv:2109.11307 (replaced) [pdf, other]

Title: Semiparametric bivariate extremevalue copulasAuthors: Javier Fernández SerranoComments: 23 pages, 22 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Computation (stat.CO)
 [26] arXiv:2110.04433 (replaced) [pdf, ps, other]

Title: Debiased Lasso for Generalized Linear Models with A Diverging Number of CovariatesComments: arXiv admin note: text overlap with arXiv:2006.12778Subjects: Methodology (stat.ME)
 [27] arXiv:2106.09769 (replaced) [pdf, other]

Title: Generalized regression operator estimation for continuous time functional data processes with missing at random responseSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
 [28] arXiv:2106.10624 (replaced) [pdf]

Title: Combined tests based on restricted mean time lost for competing risks dataComments: 26 pages, 3 figuresJournalref: Statistics in Biopharmaceutical Research, 2021Subjects: Applications (stat.AP); Methodology (stat.ME)
 [29] arXiv:2110.01571 (replaced) [pdf, other]

Title: Learning Causal Representation for Face Transfer across Large Appearance GapSubjects: Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
 [30] arXiv:2110.01593 (replaced) [pdf, other]

Title: Generalized Kernel ThinningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
[ showing up to 2000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2110, contact, help (Access key information)