Methodology
New submissions
[ showing up to 2000 entries per page: fewer  more ]
New submissions for Fri, 23 Oct 20
 [1] arXiv:2010.11332 [pdf, other]

Title: Efficient Balanced Treatment Assignments for ExperimentationSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
In this work, we reframe the problem of balanced treatment assignment as optimization of a twosample test between test and control units. Using this lens we provide an assignment algorithm that is optimal with respect to the minimum spanning tree test of Friedman and Rafsky (1979). This assignment to treatment groups may be performed exactly in polynomial time. We provide a probabilistic interpretation of this process in terms of the most probable element of designs drawn from a determinantal point process which admits a probabilistic interpretation of the design. We provide a novel formulation of estimation as transductive inference and show how the tree structures used in design can also be used in an adjustment estimator. We conclude with a simulation study demonstrating the improved efficacy of our method.
 [2] arXiv:2010.11368 [pdf, ps, other]

Title: Robust estimation in beta regression via maximum LqlikelihoodSubjects: Methodology (stat.ME)
Beta regression models are widely used for modeling continuous data limited to the unit interval, such as proportions, fractions, and rates. The inference for the parameters of beta regression models is commonly based on maximum likelihood estimation. However, it is known to be sensitive to discrepant observations. In some cases, one atypical data point can lead to severe bias and erroneous conclusions about the features of interest. In this work, we develop a robust estimation procedure for beta regression models based on the maximization of a reparameterized Lqlikelihood. The new estimator offers a tradeoff between robustness and efficiency through a tuning constant. To select the optimal value of the tuning constant, we propose a datadriven method which ensures full efficiency in the absence of outliers. We also improve on an alternative robust estimator by applying our datadriven method to select its optimum tuning constant. Monte Carlo simulations suggest marked robustness of the two robust estimators with little loss of efficiency. Applications to three datasets are presented and discussed. As a byproduct of the proposed methodology, residual diagnostic plots based on robust fits highlight outliers that would be masked under maximum likelihood estimation.
 [3] arXiv:2010.11385 [pdf, ps, other]

Title: A NormalGamma Dirichlet Process Mixture ModelSubjects: Methodology (stat.ME); Applications (stat.AP)
We propose a Dirichlet process mixture (DPM) for prediction and clusterwise variable selection, based on a NormalGamma baseline distribution on the linear regression coefficients, and which gives rise to strong posterior consistency. A simulation study and real data application showed that in terms of predictive and variable selection accuracy, the model tended to outperform the standard DPM model assigned a normal prior with no variable selection. Software code is provided in the Supplementary Information.
 [4] arXiv:2010.11449 [pdf, other]

Title: PLSO: A generative framework for decomposing nonstationary timeseries into piecewise stationary oscillatory componentsSubjects: Methodology (stat.ME); Applications (stat.AP)
To capture the slowly timevarying spectral content of realworld time series, a common paradigm is to partition the data into approximately stationary intervals and perform inference in the timefrequency domain. This approach, however, lacks a corresponding nonstationary timedomain generative model for the entire data and thus, timedomain inference, such as sampling from the posterior, occurs in each interval separately. This results in distortion/discontinuity around interval boundaries and, consequently, can lead to erroneous inferences based on any quantities derived from the posterior, such as the phase. To address these shortcomings, we propose the Piecewise Locally Stationary Oscillation (PLSO) generative model for decomposing timeseries data with slowly timevarying spectra into several oscillatory, piecewisestationary processes. PLSO, being a nonstationary timedomain generative model, enables inference on the entire timeseries, without boundary effects, and, at the same time, provides a characterization of its timevarying spectral properties. We propose a novel twostage inference algorithm that combines the classical Kalman filter and the recentlyproposed accelerated proximal gradient algorithm to optimize the nonconvex Whittle likelihood from PLSO. We demonstrate these points through experiments on simulated data and real neural data from the rat and the human brain.
 [5] arXiv:2010.11783 [pdf, other]

Title: Efficient Bayesian inference of fully stochastic epidemiological models with applications to COVID19Authors: Yuting I. Li, Günther Turk, Paul B. Rohrbach, Patrick Pietzonka, Julian Kappler, Rajesh Singh, Jakub Dolezal, Timothy Ekeh, Lukas Kikuchi, Joseph D. Peterson, Hideki Kobayashi, Michael E. Cates, R. Adhikari, Robert L. JackComments: 18 pagesSubjects: Methodology (stat.ME); Populations and Evolution (qbio.PE)
Epidemiological forecasts are beset by uncertainties in the generative model for the disease, and the surveillance process through which data are acquired. We present a Bayesian inference methodology that quantifies these uncertainties, for epidemics that are modelled by (possibly) nonstationary, continuoustime, Markov population processes. The efficiency of the method derives from a functional central limit theorem approximation of the likelihood, valid for large populations. We demonstrate the methodology by analysing the early stages of the COVID19 pandemic in the UK, based on agestructured data for the number of deaths. This includes maximum a posteriori estimates, MCMC sampling of the posterior, computation of the model evidence, and the determination of parameter sensitivities via the Fisher information matrix. Our methodology is implemented in PyRoss, an opensource platform for analysis of epidemiological compartment models.
 [6] arXiv:2010.11850 [pdf, other]

Title: Efficient design of geographicallydefined clusters with spatial autocorrelationAuthors: Samuel I. WatsonSubjects: Methodology (stat.ME)
Clusters form the basis of a number of research study designs including survey and experimental studies. Clusterbased designs can be less costly but also less efficient than individualbased designs due to correlation between individuals within the same cluster. Their design typically relies on \textit{ad hoc} choices of correlation parameters, and is insensitive to variations in cluster design. This article examines how to efficiently design clusters where they are geographically defined by demarcating areas incorporating individuals and households or other units. Using geostatistical models for spatial autocorrelation we generate approximations to within cluster average covariance in order to estimate the effective sample size given particular cluster design parameters. We show how the number of enumerated locations, cluster area, proportion sampled, and sampling method affect the efficiency of the design and consider the optimization problem of choosing the most efficient design subject to budgetary constraints. We also consider how the parameters from these approximations can be interpreted simply in terms of `realworld' quantities and used in design analysis.
Crosslists for Fri, 23 Oct 20
 [7] arXiv:2010.11330 (crosslist from stat.AP) [pdf, other]

Title: Integrated causalpredictive machine learning models for tropical cyclone epidemiologyAuthors: Rachel C. Nethery, Nina KatzChristy, MarianthiAnna Kioumourtzoglou, Robbie M. Parks, Andrea Schumacher, G. Brooke AndersonSubjects: Applications (stat.AP); Methodology (stat.ME)
Strategic preparedness has been shown to reduce the adverse health impacts of hurricanes and tropical storms, referred to collectively as tropical cyclones (TCs), but its protective impact could be enhanced by a more comprehensive and rigorous characterization of TC epidemiology. To generate the insights and tools necessary for highprecision TC preparedness, we develop and apply a novel Bayesian machine learning approach that standardizes estimation of historic TC health impacts, discovers common patterns and sources of heterogeneity in those health impacts, and enables identification of communities at highest health risk for future TCs. The model integrates (1) a causal inference component to quantify the immediate health impacts of recent historic TCs at high spatial resolution and (2) a predictive component that captures how TC meteorological features and socioeconomic/demographic characteristics of impacted communities are associated with health impacts. We apply it to a rich data platform containing detailed historic TC exposure information and Medicare claims data. The health outcomes used in our analyses are allcause mortality and cardiovascular and respiratoryrelated hospitalizations. We report a high degree of heterogeneity in the acute health impacts of historic TCs at both the TC level and the community level, with substantial increases in respiratory hospitalizations, on average, during a twoweek period surrounding TCs. TC sustained windspeeds are found to be the primary driver of increased mortality and respiratory risk. Our modeling approach has broader utility for predicting the health impacts of many types of extreme climate events.
 [8] arXiv:2010.11367 (crosslist from cs.SI) [pdf, other]

Title: TeXGraph: Coupled tensormatrix knowledgegraph embedding for COVID19 drug repurposingSubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)
Knowledge graphs (KGs) are powerful tools that codify relational behaviour between entities in knowledge bases. KGs can simultaneously model many different types of subjectpredicateobject and higherorder relations. As such, they offer a flexible modeling framework that has been applied to many areas, including biology and pharmacology  most recently, in the fight against COVID19. The flexibility of KG modeling is both a blessing and a challenge from the learning point of view. In this paper we propose a novel coupled tensormatrix framework for KG embedding. We leverage tensor factorization tools to learn concise representations of entities and relations in knowledge bases and employ these representations to perform drug repurposing for COVID19. Our proposed framework is principled, elegant, and achieves 100% improvement over the best baseline in the COVID19 drug repurposing task using a recently developed biological KG.
 [9] arXiv:2010.11417 (crosslist from math.ST) [pdf, ps, other]

Title: Positive definiteness of the asymptotic covariance matrix of OLS estimators in parsimonious regressionsAuthors: Daisuke NagakuraSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
Recently, Ghysels, Hill, and Motegi (2020) have proposed a test for a large set of zero restrictions on coefficients in regression models. They referred to the test as a max test. The test statistic for the max test is calculated by first running OLS regressions, each of which includes only one of explanatory variables whose coefficients are under examination, and then taking the maximum value of the squared OLS estimates of those coefficients. They called those regressions parsimonious regressions. In this paper, we answer a question raised in Remark 2.4 in Ghysels, Hill, and Motegi(2020), namely, whether the asymptotic covariance matrix of the OLS estimators in the parsimonious regressions is, in general, positive definite. We show that it is generally positive definite. The result may be utilized to facilitate the calculation of the simulated p values necessary for implementing the max test.
 [10] arXiv:2010.11470 (crosslist from math.ST) [pdf, ps, other]

Title: Optimal ChangePoint Detection and LocalizationComments: 73 pagesSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
Given a times series ${\bf Y}$ in $\mathbb{R}^n$, with a piecewise contant mean and independent components, the twin problems of changepoint detection and changepoint localization respectively amount to detecting the existence of times where the mean varies and estimating the positions of those changepoints. In this work, we tightly characterize optimal rates for both problems and uncover the phase transition phenomenon from a global testing problem to a local estimation problem. Introducing a suitable definition of the energy of a changepoint, we first establish in the single changepoint setting that the optimal detection threshold is $\sqrt{2\log\log(n)}$. When the energy is just above the detection threshold, then the problem of localizing the changepoint becomes purely parametric: it only depends on the difference in means and not on the position of the changepoint anymore. Interestingly, for most changepoint positions, it is possible to detect and localize them at a much smaller energy level. In the multiple changepoint setting, we establish the energy detection threshold and show similarly that the optimal localization error of a specific changepoint becomes purely parametric. Along the way, tight optimal rates for Hausdorff and $l_1$ estimation losses of the vector of all changepoints positions are also established. Two procedures achieving these optimal rates are introduced. The first one is a leastsquares estimator with a new multiscale penalty that favours well spread changepoints. The second one is a twostep multiscale postprocessing procedure whose computational complexity can be as low as $O(n\log(n))$. Notably, these two procedures accommodate with the presence of possibly many lowenergy and therefore undetectable changepoints and are still able to detect and localize highenergy changepoints even with the presence of those nuisance parameters.
 [11] arXiv:2010.11665 (crosslist from stat.ML) [pdf, ps, other]

Title: Spike and slab variational Bayes for high dimensional logistic regressionComments: NeurIPS 2020Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
Variational Bayes (VB) is a popular scalable alternative to Markov chain Monte Carlo for Bayesian inference. We study a meanfield spike and slab VB approximation of widely used Bayesian model selection priors in sparse highdimensional logistic regression. We provide nonasymptotic theoretical guarantees for the VB posterior in both $\ell_2$ and prediction loss for a sparse truth, giving optimal (minimax) convergence rates. Since the VB algorithm does not depend on the unknown truth to achieve optimality, our results shed light on effective prior choices. We confirm the improved performance of our VB algorithm over common sparse VB approaches in a numerical study.
 [12] arXiv:2010.11826 (crosslist from stat.AP) [pdf, other]

Title: Nonparametric robust monitoring of time series panel dataComments: 58 pages, 18 figuresSubjects: Applications (stat.AP); Methodology (stat.ME)
In many applications, a control procedure is required to detect potential deviations in a panel of serially correlated processes. It is common that the processes are corrupted by noise and that no prior information about the incontrol data are available for that purpose. This paper suggests a general nonparametric monitoring scheme for supervising such a panel with timevarying mean and variance. The method is based on a control chart designed by block bootstrap, which does not require parametric assumptions on the distribution of the data. The procedure is tailored to cope with strong noise, potentially missing values and absence of incontrol series, which is tackled by an intelligent exploitation of the information in the panel. Our methodology is completed by support vector machine procedures to estimate magnitude and form of the encountered deviations (such as stepwise shifts or functional drifts). This scheme, though generic in nature, is able to treat an important applied data problem: the control of deviations in a subset of sunspot number observations which are part of the International Sunspot Number, a world reference for longterm solar activity.
Replacements for Fri, 23 Oct 20
 [13] arXiv:1911.06743 (replaced) [pdf, other]

Title: Scalable and Accurate Variational Bayes for HighDimensional Binary Regression ModelsSubjects: Methodology (stat.ME); Computation (stat.CO)
 [14] arXiv:2005.08543 (replaced) [pdf, other]

Title: Necessary and sufficient conditions for causal feature selection in time series with latent common causesSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
 [15] arXiv:2001.01890 (replaced) [pdf, other]

Title: Statistical Inference for HighDimensional MatrixVariate Factor ModelSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
 [16] arXiv:2006.07571 (replaced) [pdf, other]

Title: $γ$ABC: OutlierRobust Approximate Bayesian Computation Based on a Robust Divergence EstimatorComments: 46 pages, 15 figures, adding the experimental results of simulation errorsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
 [17] arXiv:2010.09921 (replaced) [pdf, other]

Title: Sufficient dimension reduction for classification using principal optimal transport directionComments: 18 pages, 4 figures, to be published in 34th Conference on Neural Information Processing Systems (NeurIPS 2020), add the supplementary materialSubjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
[ showing up to 2000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2010, contact, help (Access key information)