Methodology
New submissions
[ showing up to 1000 entries per page: fewer  more ]
New submissions for Fri, 3 Dec 21
 [1] arXiv:2112.00807 [pdf, other]

Title: Intervention treatment distributions that depend on the observed treatment process and model double robustness in causal survival analysisComments: 19 pages, 1 figureSubjects: Methodology (stat.ME); Applications (stat.AP)
The generalized gformula can be used to estimate the probability of survival under a sustained treatment strategy. When treatment strategies are deterministic, estimators derived from the socalled efficient influence function (EIF) for the gformula will be doubly robust to model misspecification. In recent years, several practical applications have motivated estimation of the gformula under nondeterministic treatment strategies where treatment assignment at each time point depends on the observed treatment process. In this case, EIFbased estimators may or may not be doubly robust. In this paper, we provide sufficient conditions to ensure existence of doubly robust estimators for intervention treatment distributions that depend on the observed treatment process for point treatment interventions, and give a class of intervention treatment distributions dependent on the observed treatment process that guarantee model doubly and multiply robust estimators in longitudinal settings. Motivated by an application to preexposure prophylaxis (PrEP) initiation studies, we propose a new treatment intervention dependent on the observed treatment process. We show there exist 1) estimators that are doubly and multiply robust to model misspecification, and 2) estimators that when used with machine learning algorithms can attain fast convergence rates for our proposed intervention. Theoretical results are confirmed via simulation studies.
 [2] arXiv:2112.00816 [pdf, other]

Title: Maximum Likelihood Estimation for Brownian Motion Tree Models Based on One SampleSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We study the problem of maximum likelihood estimation given one data sample ($n=1$) over Brownian Motion Tree Models (BMTMs), a class of Gaussian models on trees. BMTMs are often used as a null model in phylogenetics, where the onesample regime is common. Specifically, we show that, almost surely, the onesample BMTM maximum likelihood estimator (MLE) exists, is unique, and corresponds to a fully observed tree. Moreover, we provide a polynomial time algorithm for its exact computation. We also consider the MLE over all possible BMTM tree structures in the onesample case and show that it exists almost surely, that it coincides with the MLE over diagonally dominant Mmatrices, and that it admits a unique closedform solution that corresponds to a path graph. Finally, we explore statistical properties of the onesample BMTM MLE through numerical experiments.
 [3] arXiv:2112.00832 [pdf, ps, other]

Title: On the robustness and precision of mixedmodel analysis of covariance in clusterrandomized trialsSubjects: Methodology (stat.ME)
In the analyses of clusterrandomized trials, a standard approach for covariate adjustment and handling withincluster correlations is the mixedmodel analysis of covariance (ANCOVA). The mixedmodel ANCOVA makes stringent assumptions, including normality, linearity, and a compound symmetric correlation structure, which may be challenging to verify and may not hold in practice. When mixedmodel ANCOVA assumptions are violated, the validity and efficiency of the modelbased inference for the average treatment effect are currently unclear. In this article, we prove that the mixedmodel ANCOVA estimator for the average treatment effect is consistent and asymptotically normal under arbitrary misspecification of its working model. Under equal randomization, we further show that the modelbased variance estimator for the mixedmodel ANCOVA estimator remains consistent, clarifying that the confidence interval given by standard software is asymptotically valid even under model misspecification. Beyond robustness, we also provide a caveat that covariate adjustment via mixedmodel ANCOVA may lead to precision loss compared to no adjustment when the covariance structure is misspecified, and describe when a clusterlevel ANCOVA becomes more efficient. These results hold under both simple and stratified randomization, and are further illustrated via simulations as well as analyses of three clusterrandomized trials.
 [4] arXiv:2112.00855 [pdf, ps, other]

Title: Investigating an Alternative for Estimation from a Nonprobability Sample: Matching plus CalibrationSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Matching a nonprobability sample to a probability sample is one strategy both for selecting the nonprobability units and for weighting them. This approach has been employed in the past to select subsamples of persons from a large panel of volunteers. One method of weighting, introduced here, is to assign a unit in the nonprobability sample the weight from its matched case in the probability sample. The properties of resulting estimators depend on whether the probability sample weights are inverses of selection probabilities or are calibrated. In addition, imperfect matching can cause estimates from the matched sample to be biased so that its weights need to be adjusted, especially when the size of the volunteer panel is small. Calibration weighting combined with matching is one approach to correcting bias and reducing variances. We explore the theoretical properties of the matched and matched, calibrated estimators with respect to a quasirandomization distribution that is assumed to describe how units in the nonprobability sample are observed, a superpopulation model for analysis variables collected in the nonprobability sample, and the randomization distribution for the probability sample. Numerical studies using simulated and real data from the 2015 US Behavioral Risk Factor Surveillance Survey are conducted to examine the performance of the alternative estimators.
 [5] arXiv:2112.00871 [pdf, ps, other]

Title: Diffusion Mean Estimation on the Diagonal of Product ManifoldsSubjects: Methodology (stat.ME)
Computing sample means on Riemannian manifolds is typically computationally costly. The Fr\'echet mean offers a generalization of the Euclidean mean to general metric spaces, particularly to Riemannian manifolds. Evaluating the Fr\'echet mean numerically on Riemannian manifolds requires the computation of geodesics for each sample point. When closedform expressions do not exist for geodesics, an optimizationbased approach is employed. In geometric deeplearning, particularly Riemannian convolutional neural networks, a weighted Fr\'echet mean enters each layer of the network, potentially requiring an optimization in each layer. The weighted diffusionmean offers an alternative weighted mean sample estimator on Riemannian manifolds that do not require the computation of geodesics. Instead, we present a simulation scheme to sample guided diffusion bridges on a product manifold conditioned to intersect at a predetermined time. Such a conditioning is nontrivial since, in general, manifolds cannot be covered by a single chart. Exploiting the exponential chart, the conditioning can be made similar to that in the Euclidean setting.
 [6] arXiv:2112.01164 [pdf, other]

Title: Sequential Spatially Balanced SamplingSubjects: Methodology (stat.ME)
Sequential sampling occurs when the entire population is not known in advance and data are obtained one at a time or in groups of units. This manuscript proposes a new algorithm to sequentially select a balanced sample. The algorithm respects equal and unequal inclusion probabilities. The method can also be used to select a spatially balanced sample if the population of interest contains spatial coordinates. A simulation study is proposed on a dataset of Swiss municipalities. The results show that the proposed method outperforms other methods.
 [7] arXiv:2112.01369 [pdf, other]

Title: The Classic CrossCorrelation and the RealValued Jaccard and Coincidence IndicesAuthors: Luciano da F. CostaComments: 9 pages, 8 figure. A preprintSubjects: Methodology (stat.ME); Information Theory (cs.IT)
In this work we describe and compare the classic inner product and Pearson correlation coefficient as well as the recently introduced realvalued Jaccard and coincidence indices. Special attention is given to diverse schemes for taking into account the signs of the operands, as well as on the study of the geometry of the scalar field surface related to the generalized multiset binary operations underling the considered similarity indices. The possibility to split the classic inner product, crosscorrelation, and Pearson correlation coefficient is also described.
 [8] arXiv:2112.01372 [pdf, other]

Title: Hierarchical clustering: visualization, feature importance and model selectionComments: 18 pages, 7 figuresSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
We propose methods for the analysis of hierarchical clustering that fully use the multiresolution structure provided by a dendrogram. Specifically, we propose a loss for choosing between clustering methods, a feature importance score and a graphical tool for visualizing the segmentation of features in a dendrogram. Current approaches to these tasks lead to loss of information since they require the user to generate a single partition of the instances by cutting the dendrogram at a specified level. Our proposed methods, instead, use the full structure of the dendrogram. The key insight behind the proposed methods is to view a dendrogram as a phylogeny. This analogy permits the assignment of a feature value to each internal node of a tree through ancestral state reconstruction. Real and simulated datasets provide evidence that our proposed framework has desirable outcomes. We provide an R package that implements our methods.
 [9] arXiv:2112.01374 [pdf]

Title: On the optimization of hyperparameters in Gaussian process regressionComments: 14 pages, 2 figures, 2 tablesSubjects: Methodology (stat.ME); Numerical Analysis (math.NA)
When the data are sparse, optimization of hyperparameters of the kernel in Gaussian process regression by the commonly used maximum likelihood estimation (MLE) criterion often leads to overfitting. We show that choosing hyperparameters based on a criterion of the completeness of the basis in the corresponding linear regression problem is superior to MLE. We show that this is facilitated by the use of Highdimensional model representation whereby a loworder HDMR representation can provide reliable reference functions and large synthetic test data sets needed for basis parameter optimization even with few data.
 [10] arXiv:2112.01380 [pdf, other]

Title: Prior knowledge elicitation: The past, present, and futureAuthors: Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, PaulChristian Bürkner, Arto KlamiComments: 60 pages, 1 figureSubjects: Methodology (stat.ME)
Specification of the prior distribution for a Bayesian model is a central part of the Bayesian workflow for data analysis, but it is often difficult even for statistical experts. Prior elicitation transforms domain knowledge of various kinds into welldefined prior distributions, and offers a solution to the prior specification problem, in principle. In practice, however, we are still fairly far from having usable prior elicitation tools that could significantly influence the way we build probabilistic models in academia and industry. We lack elicitation methods that integrate well into the Bayesian workflow and perform elicitation efficiently in terms of costs of time and effort. We even lack a comprehensive theoretical framework for understanding different facets of the prior elicitation problem.
Why are we not widely using prior elicitation? We analyze the state of the art by identifying a range of key aspects of prior knowledge elicitation, from properties of the modelling task and the nature of the priors to the form of interaction with the expert. The existing prior elicitation literature is reviewed and categorized in these terms. This allows recognizing understudied directions in prior elicitation research, finally leading to a proposal of several new avenues to improve prior elicitation methodology.
Crosslists for Fri, 3 Dec 21
 [11] arXiv:2112.00827 (crosslist from cs.CL) [pdf, other]

Title: Changepoint Analysis of Topic Proportions in Temporal Text DataComments: 32 pages, 9 figuresSubjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Changepoint analysis deals with unsupervised detection and/or estimation of timepoints in timeseries data, when the distribution generating the data changes. In this article, we consider \emph{offline} changepoint detection in the context of large scale textual data. We build a specialised temporal topic model with provisions for changepoints in the distribution of topic proportions. As full likelihood based inference in this model is computationally intractable, we develop a computationally tractable approximate inference procedure. More specifically, we use sample splitting to estimate topic polytopes first and then apply a likelihood ratio statistic together with a modified version of the wild binary segmentation algorithm of Fryzlewicz et al. (2014). Our methodology facilitates automated detection of structural changes in large corpora without the need of manual processing by domain experts. As changepoints under our model correspond to changes in topic structure, the estimated changepoints are often highly interpretable as marking the surge or decline in popularity of a fashionable topic. We apply our procedure on two large datasets: (i) a corpus of English literature from the period 18001922 (Underwoodet al., 2015); (ii) abstracts from the High Energy Physics arXiv repository (Clementet al., 2019). We obtain some historically wellknown changepoints and discover some new ones.
 [12] arXiv:2112.00866 (crosslist from stat.CO) [pdf, other]

Title: Bridge Simulation on Lie Groups and Homogeneous Spaces with Application to Parameter EstimationComments: arXiv admin note: text overlap with arXiv:2106.03431Subjects: Computation (stat.CO); Probability (math.PR); Methodology (stat.ME)
We present three simulation schemes for simulating Brownian bridges on complete and connected Lie groups and homogeneous spaces and use numerical results of the guided processes in the Lie group $\SO(3)$ and on the homogeneous spaces $\mathrm{SPD}(3) = \mathrm{GL}_+(3)/\mathrm{SO}(3)$ and $\mathbb S^2 = \mathrm{SO}(3)/\mathrm{SO}(2)$ to evaluate our sampling scheme. Brownian motions on Lie groups can be defined via the LaplaceBeltrami of a left (or right)invariant Riemannian metric. Given i.i.d. Lie groupvalued samples on $\mathrm{SO}(3)$ drawn from a Brownian motion with unknown Riemannian metric structure, the underlying Riemannian metric on $\mathrm{SO}(3)$ is estimated using an iterative maximum likelihood (MLE) method. Furthermore, the resampling technique is applied to yield estimates of the heat kernel on the twosphere considered as a homogeneous space. Comparing this estimate to the truncated version of the closedform expression for the heat kernel on $\mathbb S^2$ serves as a proof of concept for the validity of the sampling scheme on homogeneous spaces.
 [13] arXiv:2112.01063 (crosslist from cs.CV) [pdf, other]

Title: Fast automatic deforestation detectors and their extensions for other spatial objectsSubjects: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Methodology (stat.ME)
This paper is devoted to the problem of detection of forest and nonforest areas on Earth images. We propose two statistical methods to tackle this problem: one based on multiple hypothesis testing with parametric distribution families, another one  on nonparametric tests. The parametric approach is novel in the literature and relevant to a larger class of problems  detection of natural objects, as well as anomaly detection. We develop mathematical background for each of the two methods, build selfsufficient detection algorithms using them and discuss numerical aspects of their implementation. We also compare our algorithms with those from standard machine learning using satellite data.
Replacements for Fri, 3 Dec 21
 [14] arXiv:2006.00077 (replaced) [pdf, other]

Title: CLARITY  Comparing heterogeneous data using dissimiLARITYAuthors: Daniel J. Lawson, Vinesh Solanki, Igor Yanovich, Johannes Dellert, Damian Ruck, Phillip EndicottComments: R package available from this https URL . 30 pages, 8 FiguresSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
 [15] arXiv:2006.05371 (replaced) [pdf, other]

Title: Bayesian Probabilistic Numerical Integration with TreeBased ModelsSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
 [16] arXiv:2102.08931 (replaced) [pdf]

Title: Overcoming bias in representational similarity analysisAuthors: Roberto VivianiComments: 17 pages, 6 figuresSubjects: Methodology (stat.ME); Quantitative Methods (qbio.QM)
 [17] arXiv:2107.04873 (replaced) [pdf, ps, other]

Title: The EAS approach to variable selection for multivariate response data in highdimensional settingsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
 [18] arXiv:2109.05121 (replaced) [pdf, ps, other]

Title: Diagnostics for Monte Carlo Algorithms for Models with Intractable Normalizing FunctionsSubjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
 [19] arXiv:2111.14966 (replaced) [pdf, other]

Title: Confidence regions for univariate and multivariate data using permutation testsAuthors: Niels Lundtorp OlsenComments: Updated with author affiliation and keywordsSubjects: Methodology (stat.ME)
 [20] arXiv:2010.01396 (replaced) [pdf, other]

Title: Regularized Bayesian calibration and scoring of the WDFAB IRT model improves predictive performance over marginal maximum likelihoodComments: Revision in review PLOS oneSubjects: Applications (stat.AP); Methodology (stat.ME)
 [21] arXiv:2108.00866 (replaced) [pdf, other]

Title: Nonparametric posterior learning for emission tomography with multimodal dataSubjects: Machine Learning (stat.ML); Mathematical Physics (mathph); Applications (stat.AP); Methodology (stat.ME)
 [22] arXiv:2111.10628 (replaced) [pdf]

Title: Localized Mutual Information Monitoring of Pairwise Associations in Animal MovementAuthors: Andrew B. WhettenSubjects: Quantitative Methods (qbio.QM); Populations and Evolution (qbio.PE); Methodology (stat.ME)
[ showing up to 1000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2112, contact, help (Access key information)