We gratefully acknowledge support from
the Simons Foundation and member institutions.


New submissions

[ total of 22 entries: 1-22 ]
[ showing up to 1000 entries per page: fewer | more ]

New submissions for Fri, 3 Dec 21

[1]  arXiv:2112.00807 [pdf, other]
Title: Intervention treatment distributions that depend on the observed treatment process and model double robustness in causal survival analysis
Comments: 19 pages, 1 figure
Subjects: Methodology (stat.ME); Applications (stat.AP)

The generalized g-formula can be used to estimate the probability of survival under a sustained treatment strategy. When treatment strategies are deterministic, estimators derived from the so-called efficient influence function (EIF) for the g-formula will be doubly robust to model misspecification. In recent years, several practical applications have motivated estimation of the g-formula under non-deterministic treatment strategies where treatment assignment at each time point depends on the observed treatment process. In this case, EIF-based estimators may or may not be doubly robust. In this paper, we provide sufficient conditions to ensure existence of doubly robust estimators for intervention treatment distributions that depend on the observed treatment process for point treatment interventions, and give a class of intervention treatment distributions dependent on the observed treatment process that guarantee model doubly and multiply robust estimators in longitudinal settings. Motivated by an application to pre-exposure prophylaxis (PrEP) initiation studies, we propose a new treatment intervention dependent on the observed treatment process. We show there exist 1) estimators that are doubly and multiply robust to model misspecification, and 2) estimators that when used with machine learning algorithms can attain fast convergence rates for our proposed intervention. Theoretical results are confirmed via simulation studies.

[2]  arXiv:2112.00816 [pdf, other]
Title: Maximum Likelihood Estimation for Brownian Motion Tree Models Based on One Sample
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

We study the problem of maximum likelihood estimation given one data sample ($n=1$) over Brownian Motion Tree Models (BMTMs), a class of Gaussian models on trees. BMTMs are often used as a null model in phylogenetics, where the one-sample regime is common. Specifically, we show that, almost surely, the one-sample BMTM maximum likelihood estimator (MLE) exists, is unique, and corresponds to a fully observed tree. Moreover, we provide a polynomial time algorithm for its exact computation. We also consider the MLE over all possible BMTM tree structures in the one-sample case and show that it exists almost surely, that it coincides with the MLE over diagonally dominant M-matrices, and that it admits a unique closed-form solution that corresponds to a path graph. Finally, we explore statistical properties of the one-sample BMTM MLE through numerical experiments.

[3]  arXiv:2112.00832 [pdf, ps, other]
Title: On the robustness and precision of mixed-model analysis of covariance in cluster-randomized trials
Subjects: Methodology (stat.ME)

In the analyses of cluster-randomized trials, a standard approach for covariate adjustment and handling within-cluster correlations is the mixed-model analysis of covariance (ANCOVA). The mixed-model ANCOVA makes stringent assumptions, including normality, linearity, and a compound symmetric correlation structure, which may be challenging to verify and may not hold in practice. When mixed-model ANCOVA assumptions are violated, the validity and efficiency of the model-based inference for the average treatment effect are currently unclear. In this article, we prove that the mixed-model ANCOVA estimator for the average treatment effect is consistent and asymptotically normal under arbitrary misspecification of its working model. Under equal randomization, we further show that the model-based variance estimator for the mixed-model ANCOVA estimator remains consistent, clarifying that the confidence interval given by standard software is asymptotically valid even under model misspecification. Beyond robustness, we also provide a caveat that covariate adjustment via mixed-model ANCOVA may lead to precision loss compared to no adjustment when the covariance structure is misspecified, and describe when a cluster-level ANCOVA becomes more efficient. These results hold under both simple and stratified randomization, and are further illustrated via simulations as well as analyses of three cluster-randomized trials.

[4]  arXiv:2112.00855 [pdf, ps, other]
Title: Investigating an Alternative for Estimation from a Nonprobability Sample: Matching plus Calibration
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

Matching a nonprobability sample to a probability sample is one strategy both for selecting the nonprobability units and for weighting them. This approach has been employed in the past to select subsamples of persons from a large panel of volunteers. One method of weighting, introduced here, is to assign a unit in the nonprobability sample the weight from its matched case in the probability sample. The properties of resulting estimators depend on whether the probability sample weights are inverses of selection probabilities or are calibrated. In addition, imperfect matching can cause estimates from the matched sample to be biased so that its weights need to be adjusted, especially when the size of the volunteer panel is small. Calibration weighting combined with matching is one approach to correcting bias and reducing variances. We explore the theoretical properties of the matched and matched, calibrated estimators with respect to a quasirandomization distribution that is assumed to describe how units in the nonprobability sample are observed, a superpopulation model for analysis variables collected in the nonprobability sample, and the randomization distribution for the probability sample. Numerical studies using simulated and real data from the 2015 US Behavioral Risk Factor Surveillance Survey are conducted to examine the performance of the alternative estimators.

[5]  arXiv:2112.00871 [pdf, ps, other]
Title: Diffusion Mean Estimation on the Diagonal of Product Manifolds
Subjects: Methodology (stat.ME)

Computing sample means on Riemannian manifolds is typically computationally costly. The Fr\'echet mean offers a generalization of the Euclidean mean to general metric spaces, particularly to Riemannian manifolds. Evaluating the Fr\'echet mean numerically on Riemannian manifolds requires the computation of geodesics for each sample point. When closed-form expressions do not exist for geodesics, an optimization-based approach is employed. In geometric deep-learning, particularly Riemannian convolutional neural networks, a weighted Fr\'echet mean enters each layer of the network, potentially requiring an optimization in each layer. The weighted diffusion-mean offers an alternative weighted mean sample estimator on Riemannian manifolds that do not require the computation of geodesics. Instead, we present a simulation scheme to sample guided diffusion bridges on a product manifold conditioned to intersect at a predetermined time. Such a conditioning is non-trivial since, in general, manifolds cannot be covered by a single chart. Exploiting the exponential chart, the conditioning can be made similar to that in the Euclidean setting.

[6]  arXiv:2112.01164 [pdf, other]
Title: Sequential Spatially Balanced Sampling
Subjects: Methodology (stat.ME)

Sequential sampling occurs when the entire population is not known in advance and data are obtained one at a time or in groups of units. This manuscript proposes a new algorithm to sequentially select a balanced sample. The algorithm respects equal and unequal inclusion probabilities. The method can also be used to select a spatially balanced sample if the population of interest contains spatial coordinates. A simulation study is proposed on a dataset of Swiss municipalities. The results show that the proposed method outperforms other methods.

[7]  arXiv:2112.01369 [pdf, other]
Title: The Classic Cross-Correlation and the Real-Valued Jaccard and Coincidence Indices
Comments: 9 pages, 8 figure. A preprint
Subjects: Methodology (stat.ME); Information Theory (cs.IT)

In this work we describe and compare the classic inner product and Pearson correlation coefficient as well as the recently introduced real-valued Jaccard and coincidence indices. Special attention is given to diverse schemes for taking into account the signs of the operands, as well as on the study of the geometry of the scalar field surface related to the generalized multiset binary operations underling the considered similarity indices. The possibility to split the classic inner product, cross-correlation, and Pearson correlation coefficient is also described.

[8]  arXiv:2112.01372 [pdf, other]
Title: Hierarchical clustering: visualization, feature importance and model selection
Comments: 18 pages, 7 figures
Subjects: Methodology (stat.ME); Machine Learning (cs.LG)

We propose methods for the analysis of hierarchical clustering that fully use the multi-resolution structure provided by a dendrogram. Specifically, we propose a loss for choosing between clustering methods, a feature importance score and a graphical tool for visualizing the segmentation of features in a dendrogram. Current approaches to these tasks lead to loss of information since they require the user to generate a single partition of the instances by cutting the dendrogram at a specified level. Our proposed methods, instead, use the full structure of the dendrogram. The key insight behind the proposed methods is to view a dendrogram as a phylogeny. This analogy permits the assignment of a feature value to each internal node of a tree through ancestral state reconstruction. Real and simulated datasets provide evidence that our proposed framework has desirable outcomes. We provide an R package that implements our methods.

[9]  arXiv:2112.01374 [pdf]
Title: On the optimization of hyperparameters in Gaussian process regression
Comments: 14 pages, 2 figures, 2 tables
Subjects: Methodology (stat.ME); Numerical Analysis (math.NA)

When the data are sparse, optimization of hyperparameters of the kernel in Gaussian process regression by the commonly used maximum likelihood estimation (MLE) criterion often leads to overfitting. We show that choosing hyperparameters based on a criterion of the completeness of the basis in the corresponding linear regression problem is superior to MLE. We show that this is facilitated by the use of High-dimensional model representation whereby a low-order HDMR representation can provide reliable reference functions and large synthetic test data sets needed for basis parameter optimization even with few data.

[10]  arXiv:2112.01380 [pdf, other]
Title: Prior knowledge elicitation: The past, present, and future
Comments: 60 pages, 1 figure
Subjects: Methodology (stat.ME)

Specification of the prior distribution for a Bayesian model is a central part of the Bayesian workflow for data analysis, but it is often difficult even for statistical experts. Prior elicitation transforms domain knowledge of various kinds into well-defined prior distributions, and offers a solution to the prior specification problem, in principle. In practice, however, we are still fairly far from having usable prior elicitation tools that could significantly influence the way we build probabilistic models in academia and industry. We lack elicitation methods that integrate well into the Bayesian workflow and perform elicitation efficiently in terms of costs of time and effort. We even lack a comprehensive theoretical framework for understanding different facets of the prior elicitation problem.
Why are we not widely using prior elicitation? We analyze the state of the art by identifying a range of key aspects of prior knowledge elicitation, from properties of the modelling task and the nature of the priors to the form of interaction with the expert. The existing prior elicitation literature is reviewed and categorized in these terms. This allows recognizing under-studied directions in prior elicitation research, finally leading to a proposal of several new avenues to improve prior elicitation methodology.

Cross-lists for Fri, 3 Dec 21

[11]  arXiv:2112.00827 (cross-list from cs.CL) [pdf, other]
Title: Changepoint Analysis of Topic Proportions in Temporal Text Data
Comments: 32 pages, 9 figures
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)

Changepoint analysis deals with unsupervised detection and/or estimation of time-points in time-series data, when the distribution generating the data changes. In this article, we consider \emph{offline} changepoint detection in the context of large scale textual data. We build a specialised temporal topic model with provisions for changepoints in the distribution of topic proportions. As full likelihood based inference in this model is computationally intractable, we develop a computationally tractable approximate inference procedure. More specifically, we use sample splitting to estimate topic polytopes first and then apply a likelihood ratio statistic together with a modified version of the wild binary segmentation algorithm of Fryzlewicz et al. (2014). Our methodology facilitates automated detection of structural changes in large corpora without the need of manual processing by domain experts. As changepoints under our model correspond to changes in topic structure, the estimated changepoints are often highly interpretable as marking the surge or decline in popularity of a fashionable topic. We apply our procedure on two large datasets: (i) a corpus of English literature from the period 1800-1922 (Underwoodet al., 2015); (ii) abstracts from the High Energy Physics arXiv repository (Clementet al., 2019). We obtain some historically well-known changepoints and discover some new ones.

[12]  arXiv:2112.00866 (cross-list from stat.CO) [pdf, other]
Title: Bridge Simulation on Lie Groups and Homogeneous Spaces with Application to Parameter Estimation
Comments: arXiv admin note: text overlap with arXiv:2106.03431
Subjects: Computation (stat.CO); Probability (math.PR); Methodology (stat.ME)

We present three simulation schemes for simulating Brownian bridges on complete and connected Lie groups and homogeneous spaces and use numerical results of the guided processes in the Lie group $\SO(3)$ and on the homogeneous spaces $\mathrm{SPD}(3) = \mathrm{GL}_+(3)/\mathrm{SO}(3)$ and $\mathbb S^2 = \mathrm{SO}(3)/\mathrm{SO}(2)$ to evaluate our sampling scheme. Brownian motions on Lie groups can be defined via the Laplace-Beltrami of a left- (or right-)invariant Riemannian metric. Given i.i.d. Lie group-valued samples on $\mathrm{SO}(3)$ drawn from a Brownian motion with unknown Riemannian metric structure, the underlying Riemannian metric on $\mathrm{SO}(3)$ is estimated using an iterative maximum likelihood (MLE) method. Furthermore, the re-sampling technique is applied to yield estimates of the heat kernel on the two-sphere considered as a homogeneous space. Comparing this estimate to the truncated version of the closed-form expression for the heat kernel on $\mathbb S^2$ serves as a proof of concept for the validity of the sampling scheme on homogeneous spaces.

[13]  arXiv:2112.01063 (cross-list from cs.CV) [pdf, other]
Title: Fast automatic deforestation detectors and their extensions for other spatial objects
Subjects: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Methodology (stat.ME)

This paper is devoted to the problem of detection of forest and non-forest areas on Earth images. We propose two statistical methods to tackle this problem: one based on multiple hypothesis testing with parametric distribution families, another one -- on non-parametric tests. The parametric approach is novel in the literature and relevant to a larger class of problems -- detection of natural objects, as well as anomaly detection. We develop mathematical background for each of the two methods, build self-sufficient detection algorithms using them and discuss numerical aspects of their implementation. We also compare our algorithms with those from standard machine learning using satellite data.

Replacements for Fri, 3 Dec 21

[14]  arXiv:2006.00077 (replaced) [pdf, other]
Title: CLARITY -- Comparing heterogeneous data using dissimiLARITY
Comments: R package available from this https URL . 30 pages, 8 Figures
Subjects: Methodology (stat.ME); Machine Learning (stat.ML)
[15]  arXiv:2006.05371 (replaced) [pdf, other]
Title: Bayesian Probabilistic Numerical Integration with Tree-Based Models
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
[16]  arXiv:2102.08931 (replaced) [pdf]
Title: Overcoming bias in representational similarity analysis
Authors: Roberto Viviani
Comments: 17 pages, 6 figures
Subjects: Methodology (stat.ME); Quantitative Methods (q-bio.QM)
[17]  arXiv:2107.04873 (replaced) [pdf, ps, other]
Title: The EAS approach to variable selection for multivariate response data in high-dimensional settings
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
[18]  arXiv:2109.05121 (replaced) [pdf, ps, other]
Title: Diagnostics for Monte Carlo Algorithms for Models with Intractable Normalizing Functions
Subjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
[19]  arXiv:2111.14966 (replaced) [pdf, other]
Title: Confidence regions for univariate and multivariate data using permutation tests
Comments: Updated with author affiliation and keywords
Subjects: Methodology (stat.ME)
[20]  arXiv:2010.01396 (replaced) [pdf, other]
Title: Regularized Bayesian calibration and scoring of the WD-FAB IRT model improves predictive performance over marginal maximum likelihood
Comments: Revision in review PLOS one
Subjects: Applications (stat.AP); Methodology (stat.ME)
[21]  arXiv:2108.00866 (replaced) [pdf, other]
Title: Nonparametric posterior learning for emission tomography with multimodal data
Authors: Fedor Goncharov (LIST), Éric Barat (LIST), Thomas Dautremer (LIST)
Subjects: Machine Learning (stat.ML); Mathematical Physics (math-ph); Applications (stat.AP); Methodology (stat.ME)
[22]  arXiv:2111.10628 (replaced) [pdf]
Title: Localized Mutual Information Monitoring of Pairwise Associations in Animal Movement
Subjects: Quantitative Methods (q-bio.QM); Populations and Evolution (q-bio.PE); Methodology (stat.ME)
[ total of 22 entries: 1-22 ]
[ showing up to 1000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2112, contact, help  (Access key information)