New submissions for Fri, 23 Oct 20

[1]  arXiv:2010.11332 [pdf, other]
Title: Efficient Balanced Treatment Assignments for Experimentation
Subjects: Methodology (stat.ME); Machine Learning (stat.ML)

In this work, we reframe the problem of balanced treatment assignment as optimization of a two-sample test between test and control units. Using this lens we provide an assignment algorithm that is optimal with respect to the minimum spanning tree test of Friedman and Rafsky (1979). This assignment to treatment groups may be performed exactly in polynomial time. We provide a probabilistic interpretation of this process in terms of the most probable element of designs drawn from a determinantal point process which admits a probabilistic interpretation of the design. We provide a novel formulation of estimation as transductive inference and show how the tree structures used in design can also be used in an adjustment estimator. We conclude with a simulation study demonstrating the improved efficacy of our method.

[2]  arXiv:2010.11368 [pdf, ps, other]
Title: Robust estimation in beta regression via maximum Lq-likelihood
Subjects: Methodology (stat.ME)

Beta regression models are widely used for modeling continuous data limited to the unit interval, such as proportions, fractions, and rates. The inference for the parameters of beta regression models is commonly based on maximum likelihood estimation. However, it is known to be sensitive to discrepant observations. In some cases, one atypical data point can lead to severe bias and erroneous conclusions about the features of interest. In this work, we develop a robust estimation procedure for beta regression models based on the maximization of a reparameterized Lq-likelihood. The new estimator offers a trade-off between robustness and efficiency through a tuning constant. To select the optimal value of the tuning constant, we propose a data-driven method which ensures full efficiency in the absence of outliers. We also improve on an alternative robust estimator by applying our data-driven method to select its optimum tuning constant. Monte Carlo simulations suggest marked robustness of the two robust estimators with little loss of efficiency. Applications to three datasets are presented and discussed. As a by-product of the proposed methodology, residual diagnostic plots based on robust fits highlight outliers that would be masked under maximum likelihood estimation.

[3]  arXiv:2010.11385 [pdf, ps, other]
Title: A Normal-Gamma Dirichlet Process Mixture Model
Subjects: Methodology (stat.ME); Applications (stat.AP)

We propose a Dirichlet process mixture (DPM) for prediction and cluster-wise variable selection, based on a Normal-Gamma baseline distribution on the linear regression coefficients, and which gives rise to strong posterior consistency. A simulation study and real data application showed that in terms of predictive and variable selection accuracy, the model tended to outperform the standard DPM model assigned a normal prior with no variable selection. Software code is provided in the Supplementary Information.

[4]  arXiv:2010.11449 [pdf, other]
Title: PLSO: A generative framework for decomposing nonstationary timeseries into piecewise stationary oscillatory components
Subjects: Methodology (stat.ME); Applications (stat.AP)

To capture the slowly time-varying spectral content of real-world time series, a common paradigm is to partition the data into approximately stationary intervals and perform inference in the time-frequency domain. This approach, however, lacks a corresponding nonstationary time-domain generative model for the entire data and thus, time-domain inference, such as sampling from the posterior, occurs in each interval separately. This results in distortion/discontinuity around interval boundaries and, consequently, can lead to erroneous inferences based on any quantities derived from the posterior, such as the phase. To address these shortcomings, we propose the Piecewise Locally Stationary Oscillation (PLSO) generative model for decomposing time-series data with slowly time-varying spectra into several oscillatory, piecewise-stationary processes. PLSO, being a nonstationary time-domain generative model, enables inference on the entire time-series, without boundary effects, and, at the same time, provides a characterization of its time-varying spectral properties. We propose a novel two-stage inference algorithm that combines the classical Kalman filter and the recently-proposed accelerated proximal gradient algorithm to optimize the nonconvex Whittle likelihood from PLSO. We demonstrate these points through experiments on simulated data and real neural data from the rat and the human brain.

[5]  arXiv:2010.11783 [pdf, other]
Title: Efficient Bayesian inference of fully stochastic epidemiological models with applications to COVID-19
Comments: 18 pages
Subjects: Methodology (stat.ME); Populations and Evolution (q-bio.PE)

Epidemiological forecasts are beset by uncertainties in the generative model for the disease, and the surveillance process through which data are acquired. We present a Bayesian inference methodology that quantifies these uncertainties, for epidemics that are modelled by (possibly) non-stationary, continuous-time, Markov population processes. The efficiency of the method derives from a functional central limit theorem approximation of the likelihood, valid for large populations. We demonstrate the methodology by analysing the early stages of the COVID-19 pandemic in the UK, based on age-structured data for the number of deaths. This includes maximum a posteriori estimates, MCMC sampling of the posterior, computation of the model evidence, and the determination of parameter sensitivities via the Fisher information matrix. Our methodology is implemented in PyRoss, an open-source platform for analysis of epidemiological compartment models.

[6]  arXiv:2010.11850 [pdf, other]
Title: Efficient design of geographically-defined clusters with spatial autocorrelation
Authors: Samuel I. Watson
Subjects: Methodology (stat.ME)

Clusters form the basis of a number of research study designs including survey and experimental studies. Cluster-based designs can be less costly but also less efficient than individual-based designs due to correlation between individuals within the same cluster. Their design typically relies on \textit{ad hoc} choices of correlation parameters, and is insensitive to variations in cluster design. This article examines how to efficiently design clusters where they are geographically defined by demarcating areas incorporating individuals and households or other units. Using geostatistical models for spatial autocorrelation we generate approximations to within cluster average covariance in order to estimate the effective sample size given particular cluster design parameters. We show how the number of enumerated locations, cluster area, proportion sampled, and sampling method affect the efficiency of the design and consider the optimization problem of choosing the most efficient design subject to budgetary constraints. We also consider how the parameters from these approximations can be interpreted simply in terms of `real-world' quantities and used in design analysis.

Cross-lists for Fri, 23 Oct 20

[7]  arXiv:2010.11330 (cross-list from stat.AP) [pdf, other]
Title: Integrated causal-predictive machine learning models for tropical cyclone epidemiology
Subjects: Applications (stat.AP); Methodology (stat.ME)

Strategic preparedness has been shown to reduce the adverse health impacts of hurricanes and tropical storms, referred to collectively as tropical cyclones (TCs), but its protective impact could be enhanced by a more comprehensive and rigorous characterization of TC epidemiology. To generate the insights and tools necessary for high-precision TC preparedness, we develop and apply a novel Bayesian machine learning approach that standardizes estimation of historic TC health impacts, discovers common patterns and sources of heterogeneity in those health impacts, and enables identification of communities at highest health risk for future TCs. The model integrates (1) a causal inference component to quantify the immediate health impacts of recent historic TCs at high spatial resolution and (2) a predictive component that captures how TC meteorological features and socioeconomic/demographic characteristics of impacted communities are associated with health impacts. We apply it to a rich data platform containing detailed historic TC exposure information and Medicare claims data. The health outcomes used in our analyses are all-cause mortality and cardiovascular- and respiratory-related hospitalizations. We report a high degree of heterogeneity in the acute health impacts of historic TCs at both the TC level and the community level, with substantial increases in respiratory hospitalizations, on average, during a two-week period surrounding TCs. TC sustained windspeeds are found to be the primary driver of increased mortality and respiratory risk. Our modeling approach has broader utility for predicting the health impacts of many types of extreme climate events.

[8]  arXiv:2010.11367 (cross-list from cs.SI) [pdf, other]
Title: TeX-Graph: Coupled tensor-matrix knowledge-graph embedding for COVID-19 drug repurposing
Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)

Knowledge graphs (KGs) are powerful tools that codify relational behaviour between entities in knowledge bases. KGs can simultaneously model many different types of subject-predicate-object and higher-order relations. As such, they offer a flexible modeling framework that has been applied to many areas, including biology and pharmacology -- most recently, in the fight against COVID-19. The flexibility of KG modeling is both a blessing and a challenge from the learning point of view. In this paper we propose a novel coupled tensor-matrix framework for KG embedding. We leverage tensor factorization tools to learn concise representations of entities and relations in knowledge bases and employ these representations to perform drug repurposing for COVID-19. Our proposed framework is principled, elegant, and achieves 100% improvement over the best baseline in the COVID-19 drug repurposing task using a recently developed biological KG.

[9]  arXiv:2010.11417 (cross-list from math.ST) [pdf, ps, other]
Title: Positive definiteness of the asymptotic covariance matrix of OLS estimators in parsimonious regressions
Authors: Daisuke Nagakura
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)

Recently, Ghysels, Hill, and Motegi (2020) have proposed a test for a large set of zero restrictions on coefficients in regression models. They referred to the test as a max test. The test statistic for the max test is calculated by first running OLS regressions, each of which includes only one of explanatory variables whose coefficients are under examination, and then taking the maximum value of the squared OLS estimates of those coefficients. They called those regressions parsimonious regressions. In this paper, we answer a question raised in Remark 2.4 in Ghysels, Hill, and Motegi(2020), namely, whether the asymptotic covariance matrix of the OLS estimators in the parsimonious regressions is, in general, positive definite. We show that it is generally positive definite. The result may be utilized to facilitate the calculation of the simulated p values necessary for implementing the max test.

[10]  arXiv:2010.11470 (cross-list from math.ST) [pdf, ps, other]
Title: Optimal Change-Point Detection and Localization
Comments: 73 pages
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)

Given a times series ${\bf Y}$ in $\mathbb{R}^n$, with a piece-wise contant mean and independent components, the twin problems of change-point detection and change-point localization respectively amount to detecting the existence of times where the mean varies and estimating the positions of those change-points. In this work, we tightly characterize optimal rates for both problems and uncover the phase transition phenomenon from a global testing problem to a local estimation problem. Introducing a suitable definition of the energy of a change-point, we first establish in the single change-point setting that the optimal detection threshold is $\sqrt{2\log\log(n)}$. When the energy is just above the detection threshold, then the problem of localizing the change-point becomes purely parametric: it only depends on the difference in means and not on the position of the change-point anymore. Interestingly, for most change-point positions, it is possible to detect and localize them at a much smaller energy level. In the multiple change-point setting, we establish the energy detection threshold and show similarly that the optimal localization error of a specific change-point becomes purely parametric. Along the way, tight optimal rates for Hausdorff and $l_1$ estimation losses of the vector of all change-points positions are also established. Two procedures achieving these optimal rates are introduced. The first one is a least-squares estimator with a new multiscale penalty that favours well spread change-points. The second one is a two-step multiscale post-processing procedure whose computational complexity can be as low as $O(n\log(n))$. Notably, these two procedures accommodate with the presence of possibly many low-energy and therefore undetectable change-points and are still able to detect and localize high-energy change-points even with the presence of those nuisance parameters.

[11]  arXiv:2010.11665 (cross-list from stat.ML) [pdf, ps, other]
Title: Spike and slab variational Bayes for high dimensional logistic regression
Comments: NeurIPS 2020
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)

Variational Bayes (VB) is a popular scalable alternative to Markov chain Monte Carlo for Bayesian inference. We study a mean-field spike and slab VB approximation of widely used Bayesian model selection priors in sparse high-dimensional logistic regression. We provide non-asymptotic theoretical guarantees for the VB posterior in both $\ell_2$ and prediction loss for a sparse truth, giving optimal (minimax) convergence rates. Since the VB algorithm does not depend on the unknown truth to achieve optimality, our results shed light on effective prior choices. We confirm the improved performance of our VB algorithm over common sparse VB approaches in a numerical study.

[12]  arXiv:2010.11826 (cross-list from stat.AP) [pdf, other]
Title: Nonparametric robust monitoring of time series panel data
Comments: 58 pages, 18 figures
Subjects: Applications (stat.AP); Methodology (stat.ME)

In many applications, a control procedure is required to detect potential deviations in a panel of serially correlated processes. It is common that the processes are corrupted by noise and that no prior information about the in-control data are available for that purpose. This paper suggests a general nonparametric monitoring scheme for supervising such a panel with time-varying mean and variance. The method is based on a control chart designed by block bootstrap, which does not require parametric assumptions on the distribution of the data. The procedure is tailored to cope with strong noise, potentially missing values and absence of in-control series, which is tackled by an intelligent exploitation of the information in the panel. Our methodology is completed by support vector machine procedures to estimate magnitude and form of the encountered deviations (such as stepwise shifts or functional drifts). This scheme, though generic in nature, is able to treat an important applied data problem: the control of deviations in a subset of sunspot number observations which are part of the International Sunspot Number, a world reference for long-term solar activity.

Replacements for Fri, 23 Oct 20

