We gratefully acknowledge support from
the Simons Foundation and member institutions.


New submissions

[ total of 11 entries: 1-11 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Thu, 25 Nov 21

[1]  arXiv:2111.12149 [pdf, other]
Title: Binned multinomial logistic regression for integrative cell type annotation
Subjects: Applications (stat.AP)

Categorizing individual cells into one of many known cell type categories, also known as cell type annotation, is a critical step in the analysis of single-cell genomics data. The current process of annotation is time-intensive and subjective, which has led to different studies describing cell types with labels of varying degrees of resolution. While supervised learning approaches have provided automated solutions to annotation, there remains a significant challenge in fitting a unified model for multiple datasets with inconsistent labels. In this article, we propose a new multinomial logistic regression estimator which can be used to model cell type probabilities by integrating multiple datasets with labels of varying resolution. To compute our estimator, we solve a nonconvex optimization problem using a blockwise proximal gradient descent algorithm. We show through simulation studies that our approach estimates cell type probabilities more accurately than competitors in a wide variety of scenarios. We apply our method to ten single-cell RNA-seq datasets and demonstrate its utility in predicting fine resolution cell type labels on unlabeled data as well as refining cell type labels on data with existing coarse resolution annotations. An R package implementing the method is available at https://github.com/keshav-motwani/IBMR and the collection of datasets we analyze is available at https://github.com/keshav-motwani/AnnotatedPBMC.

[2]  arXiv:2111.12163 [pdf, other]
Title: spOccupancy: An R package for single species, multispecies, and integrated spatial occupancy models
Comments: 31 pages, 4 figures
Subjects: Applications (stat.AP)

Occupancy modeling is a common approach to assess spatial and temporal species distribution patterns, while explicitly accounting for measurement errors common in detection-nondetection data. Numerous extensions of the basic single species occupancy model exist to address dynamics, multiple species or states, interactions, false positive errors, autocorrelation, and to integrate multiple data sources. However, development of specialized and computationally efficient software to fit spatial models to large data sets is scarce or absent. We introduce the spOccupancy R package designed to fit single species, multispecies, and integrated spatially-explicit occupancy models. Using a Bayesian framework, we leverage P\'olya-Gamma data augmentation and Nearest Neighbor Gaussian Processes to ensure models are computationally efficient for potentially massive data sets. spOccupancy provides user-friendly functions for data simulation, model fitting, model validation (by posterior predictive checks), model comparison (using information criteria and k-fold cross-validation), and out-of-sample prediction. We illustrate the package's functionality via a vignette, simulated data analysis, and two bird case studies, in which we estimate occurrence of the Black-throated Green Warbler (Setophaga virens) across the eastern USA and species richness of a foliage-gleaning bird community in the Hubbard Brook Experimental Forest in New Hampshire, USA. The spOccupancy package provides a user-friendly approach to fit a variety of single and multispecies occupancy models, making it straightforward to address detection biases and spatial autocorrelation in species distribution models even for large data sets.

[3]  arXiv:2111.12272 [pdf, other]
Title: Causal Analysis and Prediction of Human Mobility in the U.S. during the COVID-19 Pandemic
Subjects: Applications (stat.AP); Machine Learning (cs.LG)

Since the increasing outspread of COVID-19 in the U.S., with the highest number of confirmed cases and deaths in the world as of September 2020, most states in the country have enforced travel restrictions resulting in sharp reductions in mobility. However, the overall impact and long-term implications of this crisis to travel and mobility remain uncertain. To this end, this study develops an analytical framework that determines and analyzes the most dominant factors impacting human mobility and travel in the U.S. during this pandemic. In particular, the study uses Granger causality to determine the important predictors influencing daily vehicle miles traveled and utilize linear regularization algorithms, including Ridge and LASSO techniques, to model and predict mobility. State-level time-series data were obtained from various open-access sources for the period starting from March 1, 2020 through June 13, 2020 and the entire data set was divided into two parts for training and testing purposes. The variables selected by Granger causality were used to train the three different reduced order models by ordinary least square regression, Ridge regression, and LASSO regression algorithms. Finally, the prediction accuracy of the developed models was examined on the test data. The results indicate that the factors including the number of new COVID cases, social distancing index, population staying at home, percent of out of county trips, trips to different destinations, socioeconomic status, percent of people working from home, and statewide closure, among others, were the most important factors influencing daily VMT. Also, among all the modeling techniques, Ridge regression provides the most superior performance with the least error, while LASSO regression also performed better than the ordinary least square model.

[4]  arXiv:2111.12283 [pdf, other]
Title: Coexchangeable process modelling for uncertainty quantification in joint climate reconstruction
Comments: Submitted to the Journal of the American Statistical Association
Subjects: Applications (stat.AP)

Any experiment with climate models relies on a potentially large set of spatio-temporal boundary conditions. These can represent both the initial state of the system and/or forcings driving the model output throughout the experiment. Whilst these boundary conditions are typically fixed using available reconstructions in climate modelling studies, they are highly uncertain, that uncertainty is unquantified, and the effect on the output of the experiment can be considerable. We develop efficient quantification of these uncertainties that combines relevant data from multiple models and observations. Starting from the coexchangeability model, we develop a coexchangable process model to capture multiple correlated spatio-temporal fields of variables. We demonstrate that further exchangeability judgements over the parameters within this representation lead to a Bayes linear analogy of a hierarchical model. We use the framework to provide a joint reconstruction of sea-surface temperature and sea-ice concentration boundary conditions at the last glacial maximum (19-23 ka) and use it to force an ensemble of ice-sheet simulations using the FAMOUS-Ice coupled atmosphere and ice-sheet model. We demonstrate that existing boundary conditions typically used in these experiments are implausible given our uncertainties and demonstrate the impact of using more plausible boundary conditions on ice-sheet simulation.

[5]  arXiv:2111.12348 [pdf]
Title: Comparative Evaluation of Statistical Orbit Determination Algorithms for Short-Term Prediction of Geostationary and Geosynchronous Satellite Orbits in NavIC Constellation
Subjects: Applications (stat.AP)

NavIC is a newly established Indian regional Navigation Constellation with 3 satellites in geostationary Earth orbit (GEO) and 4 satellites in geosynchronous orbit (GSO). Satellite positions are essential in navigation for various positioning applications. In this paper, we propose a Bootstrap Particle Filter (BPF) approach to determine the satellite positions in NavIC constellation for short duration of 1 hr. The Bootstrap Particle filter-based approach was found to be efficient with meter level prediction accuracy as compared to other methods such as Least Squares (LS), Extended Kalman Filter (EKF), Unscented Kalman Filter (UKF) and Ensemble Kalman Filter (EnKF). The residual analysis revealed that the BPF approach addressed the problem of non-linearity in the dynamics model as well as non-Gaussian nature of the state of the NavIC satellites.

[6]  arXiv:2111.12526 [pdf]
Title: Mining Meta-indicators of University Ranking: A Machine Learning Approach Based on SHAP
Authors: Shudong Yang (1), Miaomiao Liu (1) ((1) Dalian University of Technology)
Comments: 4 pages, 1 figure
Subjects: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)

University evaluation and ranking is an extremely complex activity. Major universities are struggling because of increasingly complex indicator systems of world university rankings. So can we find the meta-indicators of the index system by simplifying the complexity? This research discovered three meta-indicators based on interpretable machine learning. The first one is time, to be friends with time, and believe in the power of time, and accumulate historical deposits; the second one is space, to be friends with city, and grow together by co-develop; the third one is relationships, to be friends with alumni, and strive for more alumni donations without ceiling.

Cross-lists for Thu, 25 Nov 21

[7]  arXiv:2111.12201 (cross-list from stat.ME) [pdf, other]
Title: Parameter estimation and uncertainty quantification using information geometry
Comments: 50 pages (exc. references), 12 figures. Review
Subjects: Methodology (stat.ME); Applications (stat.AP)

In this work we (1) review likelihood-based inference for parameter estimation and the construction of confidence regions, and (2) explore the use of techniques from information geometry, including geodesic curves and Riemann scalar curvature, to supplement typical techniques for uncertainty quantification such as Bayesian methods, profile likelihood, asymptotic analysis and bootstrapping. These techniques from information geometry provide data-independent insights into uncertainty and identifiability, and can be used to inform data collection decisions. All code used in this work to implement the inference and information geometry techniques is available on GitHub.

[8]  arXiv:2111.12267 (cross-list from stat.OT) [pdf, other]
Title: The Practical Scope of the Central Limit Theorem
Comments: 47 pages, 17 figures
Subjects: Other Statistics (stat.OT); Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)

The \textit{Central Limit Theorem (CLT)} is at the heart of a great deal of applied problem-solving in statistics and data science, but the theorem is silent on an important implementation issue: \textit{how much data do you need for the CLT to give accurate answers to practical questions?} Here we examine several approaches to addressing this issue -- along the way reviewing the history of this problem over the last 290 years -- and we illustrate the calculations with case-studies from finite-population sampling and gambling. A variety of surprises emerge.

[9]  arXiv:2111.12486 (cross-list from physics.ao-ph) [pdf, other]
Title: Enhanced monitoring of atmospheric methane from space with hierarchical Bayesian inference
Comments: 20 pages, 6 figures. Under consideration at Nature Communications
Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Geophysics (physics.geo-ph); Applications (stat.AP)

Methane is a strong greenhouse gas, with a higher radiative forcing per unit mass and shorter atmospheric lifetime than carbon dioxide. The remote sensing of methane in regions of industrial activity is a key step toward the accurate monitoring of emissions that drive climate change. Whilst the TROPOspheric Monitoring Instrument (TROPOMI) on board the Sentinal-5P satellite is capable of providing daily global measurement of methane columns, data are often compromised by cloud cover. Here, we develop a statistical model which uses nitrogen dioxide concentration data from TROPOMI to accurately predict values of methane columns, expanding the average daily spatial coverage of observations of the Permian Basin from 16% to 88% in the year 2019. The addition of predicted methane abundances at locations where direct observations are not available will support inversion methods for estimating methane emission rates at shorter timescales than is currently possible.

[10]  arXiv:2111.12612 (cross-list from math.ST) [pdf, other]
Title: Multiplier bootstrap for Bures-Wasserstein barycenters
Comments: 36 pages, 2 figures
Subjects: Statistics Theory (math.ST); Applications (stat.AP)

Bures-Wasserstein barycenter is a popular and promising tool in analysis of complex data like graphs, images etc. In many applications the input data are random with an unknown distribution, and uncertainty quantification becomes a crucial issue. This paper offers an approach based on multiplier bootstrap to quantify the error of approximating the true Bures--Wasserstein barycenter $Q_*$ by its empirical counterpart $Q_n$. The main results state the bootstrap validity under general assumptions on the data generating distribution $P$ and specifies the approximation rates for the case of sub-exponential $P$. The performance of the method is illustrated on synthetic data generated from the weighted stochastic block model.

Replacements for Thu, 25 Nov 21

[11]  arXiv:2109.02624 (replaced) [pdf, other]
Title: Functional additive models on manifolds of planar shapes and forms
Subjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
[ total of 11 entries: 1-11 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2111, contact, help  (Access key information)