We gratefully acknowledge support from
the Simons Foundation and member institutions.

Applications

New submissions

[ total of 24 entries: 1-24 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Tue, 19 Oct 21

[1]  arXiv:2110.08363 [pdf, other]
Title: Spatio-temporal extreme event modeling of terror insurgencies
Subjects: Applications (stat.AP); Machine Learning (stat.ML)

Extreme events with potential deadly outcomes, such as those organized by terror groups, are highly unpredictable in nature and an imminent threat to society. In particular, quantifying the likelihood of a terror attack occurring in an arbitrary space-time region and its relative societal risk, would facilitate informed measures that would strengthen national security. This paper introduces a novel self-exciting marked spatio-temporal model for attacks whose inhomogeneous baseline intensity is written as a function of covariates. Its triggering intensity is succinctly modeled with a Gaussian Process prior distribution to flexibly capture intricate spatio-temporal dependencies between an arbitrary attack and previous terror events. By inferring the parameters of this model, we highlight specific space-time areas in which attacks are likely to occur. Furthermore, by measuring the outcome of an attack in terms of the number of casualties it produces, we introduce a novel mixture distribution for the number of casualties. This distribution flexibly handles low and high number of casualties and the discrete nature of the data through a {\it Generalized ZipF} distribution. We rely on a customized Markov chain Monte Carlo (MCMC) method to estimate the model parameters. We illustrate the methodology with data from the open source Global Terrorism Database (GTD) that correspond to attacks in Afghanistan from 2013-2018. We show that our model is able to predict the intensity of future attacks for 2019-2021 while considering various covariates of interest such as population density, number of regional languages spoken, and the density of population supporting the opposing government.

[2]  arXiv:2110.08648 [pdf]
Title: Minding non-collapsibility of odds ratios when recalibrating risk prediction models
Comments: 10 Pages, 1 Figure, 1 Appendix
Subjects: Applications (stat.AP)

In clinical prediction modeling, model updating refers to the practice of modifying a prediction model before it is used in a new setting. In the context of logistic regression for a binary outcome, one of the simplest updating methods is a fixed odds-ratio transformation of predicted risks to improve calibration-in-the-large. Previous authors have proposed equations for calculating this odds-ratio based on the discrepancy between the prevalence in the original and the new population, or between the average of predicted and observed risks. We show that this method fails to consider the non-collapsibility of odds-ratio. Consequently, it under-corrects predicted risks, especially when predicted risks are more dispersed (i.e., for models with good discrimination). We suggest an approximate equation for recovering the conditional odds-ratio from the mean and variance of predicted risks. Brief simulations and a case study show that this approach reduces such under-correction. R code for implementation is provided.

[3]  arXiv:2110.08849 [pdf, other]
Title: A Bayesian Selection Model for Correcting Outcome Reporting Bias With Application to a Meta-analysis on Heart Failure Interventions
Comments: 26 pages, 5 tables, 8 figures
Subjects: Applications (stat.AP)

Multivariate meta-analysis (MMA) is a powerful tool for jointly estimating multiple outcomes' treatment effects. However, the validity of results from MMA is potentially compromised by outcome reporting bias (ORB), or the tendency for studies to selectively report outcomes. Until recently, ORB has been understudied. Since ORB can lead to biased conclusions, it is crucial to correct the estimates of effect sizes and quantify their uncertainty in the presence of ORB. With this goal, we develop a Bayesian selection model to adjust for ORB in MMA. We further propose a measure for quantifying the impact of ORB on the results from MMA. We evaluate our approaches through a meta-evaluation of 748 bivariate meta-analyses from the Cochrane Database of Systematic Reviews. Our model is motivated by and applied to a meta-analysis of interventions on hospital readmission and quality of life for heart failure patients. In our analysis, the relative risk (RR) of hospital readmission for the intervention group changes from a significant decrease (RR: 0.931, 95% confidence interval [CI]: 0.862-0.993) to a statistically nonsignificant effect (RR: 0.955, 95% CI: 0.876-1.051) after adjusting for ORB. This study demonstrates that failing to account for ORB can lead to different conclusions in a meta-analysis.

[4]  arXiv:2110.08882 [pdf, ps, other]
Title: Building Degradation Index with Variable Selection for Multivariate Sensory Data
Comments: 28 pages
Subjects: Applications (stat.AP)

The modeling and analysis of degradation data have been an active research area in reliability and system health management. As the senor technology advances, multivariate sensory data are commonly collected for the underlying degradation process. However, most existing research on degradation modeling requires a univariate degradation index to be provided. Thus, constructing a degradation index for multivariate sensory data is a fundamental step in degradation modeling. In this paper, we propose a novel degradation index building method for multivariate sensory data. Based on an additive nonlinear model with variable selection, the proposed method can automatically select the most informative sensor signals to be used in the degradation index. The penalized likelihood method with adaptive group penalty is developed for parameter estimation. We demonstrate that the proposed method outperforms existing methods via both simulation studies and analyses of the NASA jet engine sensor data.

[5]  arXiv:2110.08905 [pdf, other]
Title: Exploitation of error correlation in a large analysis validation: GlobCurrent case study
Comments: 24 pages, 14 figures
Journal-ref: Remote Sens. Environ., 217, 476-490 (2018)
Subjects: Applications (stat.AP); Statistics Theory (math.ST); Methodology (stat.ME)

An assessment of variance in ocean current signal and noise shared by in situ observations (drifters) and a large gridded analysis (GlobCurrent) is sought as a function of day of the year for 1993-2015 and across a broad spectrum of current speed. Regardless of the division of collocations, it is difficult to claim that any synoptic assessment can be based on independent observations. Instead, a measurement model that departs from ordinary linear regression by accommodating error correlation is proposed. The interpretation of independence is explored by applying Fuller's (1987) concept of equation and measurement error to a division of error into shared (correlated) and unshared (uncorrelated) components, respectively. The resulting division of variance in the new model favours noise. Ocean current shared (equation) error is of comparable magnitude to unshared (measurement) error and the latter is, for GlobCurrent and drifters respectively, comparable to ordinary and reverse linear regression. Although signal variance appears to be small, its utility as a measure of agreement between two variates is highlighted.
Sparse collocations that sample a dense grid permit a first order autoregressive form of measurement model to be considered, including parameterizations of analysis-in situ error cross-correlation and analysis temporal error autocorrelation. The former (cross-correlation) is an equation error term that accommodates error shared by both GlobCurrent and drifters. The latter (autocorrelation) facilitates an identification and retrieval of all model parameters. Solutions are sought using a prescribed calibration between GlobCurrent and drifters (by variance matching). Because the true current variance of GlobCurrent and drifters is small, signal to noise ratio is near zero at best. This is particularly evident for moderate current speed and meridional current component.

[6]  arXiv:2110.08967 [pdf, other]
Title: Assessing Ecosystem State Space Models: Identifiability and Estimation
Subjects: Applications (stat.AP); Quantitative Methods (q-bio.QM)

Bayesian methods are increasingly being applied to parameterize mechanistic process models used in environmental prediction and forecasting. In particular, models describing ecosystem dynamics with multiple states that are linear and autoregressive at each step in time can be treated as statistical state space models. In this paper we examine this subset of ecosystem models, giving closed form Gibbs sampling updates for latent states and process precision parameters when process and observation errors are normally distributed. We use simulated data from an example model (DALECev) to assess the performance of parameter estimation and identifiability under scenarios of gaps in observations. We show that process precision estimates become unreliable as temporal gaps between observed state data increase. To improve estimates, particularly precisions, we introduce a method of tuning the timestep of the latent states to leverage higher-frequency driver information. Further, we show that data cloning is a suitable method for assessing parameter identifiability in this class of models. Overall, our study helps inform the application of state space models to ecological forecasting applications where 1) data are not available for all states and transfers at the operational timestep for the ecosystem model and 2) process uncertainty estimation is desired.

[7]  arXiv:2110.08969 [pdf, ps, other]
Title: On completing a measurement model by symmetry
Comments: 4 pages
Subjects: Applications (stat.AP); Statistics Theory (math.ST); Methodology (stat.ME)

An appeal for symmetry is made to build established notions of specific representation and specific nonlinearity of measurement (often called model error) into a canonical linear regression model. Additive components are derived from the trivially complete model M = m. Factor analysis and equation error motivate corresponding notions of representation and nonlinearity in an errors-in-variables framework, with a novel interpretation of terms. It is suggested that a modern interpretation of correlation involves both linear and nonlinear association.

[8]  arXiv:2110.09013 [pdf, other]
Title: A Space-time Model for Inferring A Susceptibility Map for An Infectious Disease
Subjects: Applications (stat.AP)

Motivated by foot-and-mouth disease (FMD) outbreak data from Turkey, we develop a model to estimate disease risk based on a space-time record of outbreaks. The spread of infectious disease in geographical units depends on both transmission between neighbouring units and the intrinsic susceptibility of each unit to an outbreak. Spatially correlated susceptibility may arise from known factors, such as population density, or unknown (or unmeasured) factors such as commuter flows, environmental conditions, or health disparities. Our framework accounts for both space-time transmission and susceptibility. We model the unknown spatially correlated susceptibility as a Gaussian process. We show that the susceptibility surface can be estimated from observed, geo-located time series of infection events and use a projection-based dimension reduction approach which improves computational efficiency. In addition to identifying high risk regions from the Turkey FMD data, we also study how our approach works on the well known England-Wales measles outbreaks data; our latter study results in an estimated susceptibility surface that is strongly correlated with population size, consistent with prior analyses.

[9]  arXiv:2110.09497 [pdf, other]
Title: Gradient boosting with extreme-value theory for wildfire prediction
Authors: Jonathan Koh
Subjects: Applications (stat.AP)

This paper details the approach of the team $\textit{Kohrrelation}$ in the 2021 Extreme Value Analysis data challenge, dealing with the prediction of wildfire counts and sizes over the contiguous US. Our approach uses ideas from extreme-value theory in a machine learning context with theoretically justified loss functions for gradient boosting. We devise a spatial cross-validation scheme and show that in our setting it provides a better proxy for test set performance than naive cross-validation. The predictions are benchmarked against boosting approaches with different loss functions, and perform competitively in terms of the score criterion, finally placing second in the competition ranking.

Cross-lists for Tue, 19 Oct 21

[10]  arXiv:2110.08331 (cross-list from cs.LG) [pdf, other]
Title: A New Approach for Interpretability and Reliability in Clinical Risk Prediction: Acute Coronary Syndrome Scenario
Comments: Accepted for publication in the Artificial Intelligence in Medicine journal. Abstract abridged to respect the arXiv's characters limit
Journal-ref: Artificial Intelligence in Medicine, Volume 117, 2021
Subjects: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)

We intend to create a new risk assessment methodology that combines the best characteristics of both risk score and machine learning models. More specifically, we aim to develop a method that, besides having a good performance, offers a personalized model and outcome for each patient, presents high interpretability, and incorporates an estimation of the prediction reliability which is not usually available. By combining these features in the same approach we expect that it can boost the confidence of physicians to use such a tool in their daily activity. In order to achieve the mentioned goals, a three-step methodology was developed: several rules were created by dichotomizing risk factors; such rules were trained with a machine learning classifier to predict the acceptance degree of each rule (the probability that the rule is correct) for each patient; that information was combined and used to compute the risk of mortality and the reliability of such prediction. The methodology was applied to a dataset of patients admitted with any type of acute coronary syndromes (ACS), to assess the 30-days all-cause mortality risk. The performance was compared with state-of-the-art approaches: logistic regression (LR), artificial neural network (ANN), and clinical risk score model (Global Registry of Acute Coronary Events - GRACE). The proposed approach achieved testing results identical to the standard LR, but offers superior interpretability and personalization; it also significantly outperforms the GRACE risk model and the standard ANN model. The calibration curve also suggests a very good generalization ability of the obtained model as it approaches the ideal curve. Finally, the reliability estimation of individual predictions presented a great correlation with the misclassifications rate. Those properties may have a beneficial application in other clinical scenarios as well. [abridged]

[11]  arXiv:2110.08411 (cross-list from stat.ME) [pdf, other]
Title: Multi-group Gaussian Processes
Subjects: Methodology (stat.ME); Applications (stat.AP)

Gaussian processes (GPs) are pervasive in functional data analysis, machine learning, and spatial statistics for modeling complex dependencies. Modern scientific data sets are typically heterogeneous and often contain multiple known discrete subgroups of samples. For example, in genomics applications samples may be grouped according to tissue type or drug exposure. In the modeling process it is desirable to leverage the similarity among groups while accounting for differences between them. While a substantial literature exists for GPs over Euclidean domains $\mathbb{R}^p$, GPs on domains suitable for multi-group data remain less explored. Here, we develop a multi-group Gaussian process (MGGP), which we define on $\mathbb{R}^p\times \mathscr{C}$, where $\mathscr{C}$ is a finite set representing the group label. We provide general methods to construct valid (positive definite) covariance functions on this domain, and we describe algorithms for inference, estimation, and prediction. We perform simulation experiments and apply MGGP to gene expression data to illustrate the behavior and advantages of the MGGP in the joint modeling of continuous and categorical variables.

[12]  arXiv:2110.08570 (cross-list from stat.ME) [pdf, other]
Title: A Reduced-Bias Weighted least square estimation of the Extreme Value Index
Comments: 24 pages
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)

In this paper, we propose a reduced-bias estimator of the EVI for Pareto-type tails (heavy-tailed) distributions. This is derived using the weighted least squares method. It is shown that the estimator is unbiased, consistent and asymptotically normal under the second-order conditions on the underlying distribution of the data. The finite sample properties of the proposed estimator are studied through a simulation study. The results show that it is competitive to the existing estimators of the extreme value index in terms of bias and Mean Square Error. In addition, it yields estimates of $\gamma>0$ that are less sensitive to the number of top-order statistics, and hence, can be used for selecting an optimal tail fraction. The proposed estimator is further illustrated using practical datasets from pedochemical and insurance.

[13]  arXiv:2110.08605 (cross-list from cs.DL) [pdf, other]
Title: Statistics in everyone's backyard: an impact study via citation network analysis
Subjects: Digital Libraries (cs.DL); Applications (stat.AP)

The increasing availability of curated citation data provides a wealth of resources for analyzing and understanding the intellectual influence of scientific publications. In the field of statistics, current studies of citation data have mostly focused on the interactions between statistical journals and papers, limiting the measure of influence to mainly within statistics itself. In this paper, we take the first step towards understanding the impact statistics has made on other scientific fields in the era of Big Data. By collecting comprehensive bibliometric data from the Web of Science database for selected statistical journals, we investigate the citation trends and compositions of citing fields over time to show that their diversity has been increasing. Furthermore, we use the local clustering technique involving personalized PageRank with conductance for size selection to find the most relevant statistical research area for a given external topic of interest. We provide theoretical guarantees for the procedure and, through a number of case studies, show the results from our citation data align well with our knowledge and intuition about these external topics. Overall, we have found that the statistical theory and methods recently invented by the statistics community have made increasing impact on other scientific fields.

[14]  arXiv:2110.08970 (cross-list from stat.ME) [pdf, other]
Title: Sample size calculations for n-of-1 trials
Subjects: Methodology (stat.ME); Applications (stat.AP)

N-of-1 trials, single participant trials in which multiple treatments are sequentially randomized over the study period, can give direct estimates of individual-specific treatment effects. Combining n-of-1 trials gives extra information for estimating the population average treatment effect compared with randomized controlled trials and increases precision for individual-specific treatment effect estimates. In this paper, we present a procedure for designing n-of-1 trials. We formally define the design components for determining the sample size of a series of n-of-1 trials, present models for analyzing these trials and use them to derive the sample size formula for estimating the population average treatment effect and the standard error of the individual-specific treatment effect estimates. We recommend first finding the possible designs that will satisfy the power requirement for estimating the population average treatment effect and then, if of interest, finalizing the design to also satisfy the standard error requirements for the individual-specific treatment effect estimates. The procedure is implemented and illustrated in the paper and through a Shiny app.

[15]  arXiv:2110.09154 (cross-list from cs.SI) [pdf, other]
Title: Measuring the influence of beliefs in belief networks
Comments: 19 pages, 4 figures. Earlier version of this work was presented at Networks 2021 conference
Subjects: Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph); Applications (stat.AP)

Influential beliefs are crucial for our understanding of how people reason about political issues and make political decisions. This research proposes a new method for measuring the influence of political beliefs within larger context of belief system networks, based on the advances in psychometric network methods and network influence research. Using the latest round of the European Social Survey data, we demonstrate this approach on a belief network expressing support for the regime in 29 European countries and capturing beliefs related to support for regime performance, principles, institutions, and political actors. Our results show that the average influence of beliefs can be related to the consistency and connectivity of the belief network and that the influence of specific beliefs (e.g. Satisfaction with Democracy) on a country level has a significant negative correlation with external indicators from the same domain (e.g. Liberal Democracy index), which suggests that highly influential beliefs are related to pressing political issues. These findings suggest that network-based belief influence metrics estimated from large-scale survey data can be used a new type of indicator in comparative political research, which opens new avenues for integrating psychometric network analysis methods into political science methodology.

[16]  arXiv:2110.09234 (cross-list from cs.CY) [pdf, other]
Title: Impact of COVID-19 Policies and Misinformation on Social Unrest
Authors: Martha Barnard (1), Radhika Iyer (1 and 2), Sara Y. Del Valle (1), Ashlynn R. Daughton (1) ((1) A-1 Information Systems and Modeling, Los Alamos National Lab, Los Alamos, NM, USA, (2) Department of Political Science and Department of Computing, Data Science, and Society, University of California, Berkeley, Berkeley, CA, USA)
Comments: 21 pages, 9 figures
Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP)

The novel coronavirus disease (COVID-19) pandemic has impacted every corner of earth, disrupting governments and leading to socioeconomic instability. This crisis has prompted questions surrounding how different sectors of society interact and influence each other during times of change and stress. Given the unprecedented economic and societal impacts of this pandemic, many new data sources have become available, allowing us to quantitatively explore these associations. Understanding these relationships can help us better prepare for future disasters and mitigate the impacts. Here, we focus on the interplay between social unrest (protests), health outcomes, public health orders, and misinformation in eight countries of Western Europe and four regions of the United States. We created 1-3 week forecasts of both a binary protest metric for identifying times of high protest activity and the overall protest counts over time. We found that for all regions, except Belgium, at least one feature from our various data streams was predictive of protests. However, the accuracy of the protest forecasts varied by country, that is, for roughly half of the countries analyzed, our forecasts outperform a na\"ive model. These mixed results demonstrate the potential of diverse data streams to predict a topic as volatile as protests as well as the difficulties of predicting a situation that is as rapidly evolving as a pandemic.

[17]  arXiv:2110.09272 (cross-list from cs.CY) [pdf]
Title: Multi-Objective Allocation of COVID-19 Testing Centers: Improving Coverage and Equity in Access
Subjects: Computers and Society (cs.CY); Optimization and Control (math.OC); Applications (stat.AP)

At the time of this article, COVID-19 has been transmitted to more than 42 million people and resulted in more than 673,000 deaths across the United States. Throughout this pandemic, public health authorities have monitored the results of diagnostic testing to identify hotspots of transmission. Such information can help reduce or block transmission paths of COVID-19 and help infected patients receive early treatment. However, most current schemes of test site allocation have been based on experience or convenience, often resulting in low efficiency and non-optimal allocation. In addition, the historical sociodemographic patterns of populations within cities can result in measurable inequities in access to testing between various racial and income groups. To address these pressing issues, we propose a novel test site allocation scheme to (a) maximize population coverage, (b) minimize prediction uncertainties associated with projections of outbreak trajectories, and (c) reduce inequities in access. We illustrate our approach with case studies comparing our allocation scheme with recorded allocation of testing sites in Georgia, revealing increases in both population coverage and improvements in equity of access over current practice.

[18]  arXiv:2110.09429 (cross-list from q-fin.TR) [pdf, other]
Title: Understanding jumps in high frequency digital asset markets
Subjects: Trading and Market Microstructure (q-fin.TR); Applications (stat.AP)

While attention is a predictor for digital asset prices, and jumps in Bitcoin prices are well-known, we know little about its alternatives. Studying high frequency crypto data gives us the unique possibility to confirm that cross market digital asset returns are driven by high frequency jumps clustered around black swan events, resembling volatility and trading volume seasonalities. Regressions show that intra-day jumps significantly influence end of day returns in size and direction. This provides fundamental research for crypto option pricing models. However, we need better econometric methods for capturing the specific market microstructure of cryptos. All calculations are reproducible via the quantlet.com technology.

Replacements for Tue, 19 Oct 21

[19]  arXiv:2106.10624 (replaced) [pdf]
Title: Combined tests based on restricted mean time lost for competing risks data
Comments: 26 pages, 3 figures
Journal-ref: Statistics in Biopharmaceutical Research, 2021
Subjects: Applications (stat.AP); Methodology (stat.ME)
[20]  arXiv:1911.09171 (replaced) [pdf, other]
Title: Re-Evaluating Strengthened-IV Designs: Asymptotic Efficiency, Bias Formula, and the Validity and Power of Sensitivity Analyses
Comments: 86 pages, 4 figures, 6 tables
Subjects: Methodology (stat.ME); Applications (stat.AP)
[21]  arXiv:2007.14052 (replaced) [pdf, other]
Title: Multioutput Gaussian Processes with Functional Data: A Study on Coastal Flood Hazard Assessment
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
[22]  arXiv:2103.03632 (replaced) [pdf, other]
Title: Modeling tail risks of inflation using unobserved component quantile regressions
Comments: JEL: C11, C22, C53, E31; Keywords: state space models, time-varying parameters, stochastic volatility, predictive inference
Subjects: Econometrics (econ.EM); Applications (stat.AP)
[23]  arXiv:2104.14204 (replaced) [pdf, other]
Title: Optimal bidding in hourly and quarter-hourly electricity price auctions: trading large volumes of power with market impact and transaction costs
Subjects: Statistical Finance (q-fin.ST); Mathematical Finance (q-fin.MF); Portfolio Management (q-fin.PM); Trading and Market Microstructure (q-fin.TR); Applications (stat.AP)
[24]  arXiv:2110.05430 (replaced) [pdf, other]
Title: Density-based interpretable hypercube region partitioning for mixed numeric and categorical data
Subjects: Machine Learning (cs.LG); Applications (stat.AP)
[ total of 24 entries: 1-24 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2110, contact, help  (Access key information)