Statistics
New submissions
[ showing up to 2000 entries per page: fewer  more ]
New submissions for Fri, 21 Jan 22
 [1] arXiv:2201.07796 [pdf, other]

Title: The R package $\texttt{ebmstate}$ for disease progression analysis under empirical Bayes Cox modelsSubjects: Computation (stat.CO); Methodology (stat.ME)
The software package $\texttt{mstate}$, in articulation with the package $\texttt{survival}$, provides not only a wellestablished multistate survival analysis framework in R, but also one of the most complete, as it includes point and interval estimation of relative transition hazards, cumulative transition hazards and state occupation probabilities, both under clockforward and clockreset models; personalised estimates, i.e. estimates for an individual with specific covariate measurements, can also be obtained with $\texttt{mstate}$ by fitting a Cox regression model. The new R package $\texttt{ebmstate}$, which we present in the current paper, is an extension of $\texttt{mstate}$ and, to our knowledge, the first R package for multistate model estimation that is suitable for higherdimensional data and complete in the sense just mentioned. Its extension of $\texttt{mstate}$ is threefold: it transforms the Cox model into a regularised, empirical Bayes model that performs significantly better with higherdimensional data; it replaces asymptotic confidence intervals meant for the lowdimensional setting by nonparametric bootstrap confidence intervals; and it introduces an analytical, Fourier transformbased estimator of state occupation probabilities for clockreset models that is substantially faster than the corresponding, simulationbased estimator in $\texttt{mstate}$. The present paper includes a detailed tutorial on how to use our package to estimate transition hazards and state occupation probabilities, as well as a simulation study showing how it improves the performance of $\texttt{mstate}$.
 [2] arXiv:2201.07830 [pdf, other]

Title: Using Joint Random Partition Models for Flexible Change Point Analysis in Multivariate ProcessesComments: 25 pages, 6 figures, 1 tableSubjects: Methodology (stat.ME)
Change point analyses are concerned with identifying positions of an ordered stochastic process that undergo abrupt local changes of some underlying distribution. When multiple processes are observed, it is often the case that information regarding the change point positions is shared across the different processes. This work describes a method that takes advantage of this type of information. Since the number and position of change points can be described through a partition with contiguous clusters, our approach develops a joint model for these types of partitions. We describe computational strategies associated with our approach and illustrate improved performance in detecting change points through a small simulation study. We then apply our method to a financial data set of emerging markets in Latin America and highlight interesting insights discovered due to the correlation between change point locations among these economies.
 [3] arXiv:2201.07874 [pdf, ps, other]

Title: Bayesian Prediction with Covariates Subject to Detection LimitsSubjects: Methodology (stat.ME); Applications (stat.AP)
Missing values in covariates due to censoring by signal interference or lack of sensitivity in the measuring devices are common in industrial problems. We propose a full Bayesian solution to the prediction problem with an efficient Markov Chain Monte Carlo (MCMC) algorithm that updates all the censored covariate values jointly in a random scan Gibbs sampler. We show that the joint updating of missing covariate values can be at least two orders of magnitude more efficient than univariate updating. This increased efficiency is shown to be crucial for quickly learning the missing covariate values and their uncertainty in a realtime decision making context, in particular when there is substantial correlation in the posterior for the missing values. The approach is evaluated on simulated data and on data from the telecom sector. Our results show that the proposed Bayesian imputation gives substantially more accurate predictions than na\"ive imputation, and that the use of auxiliary variables in the imputation gives additional predictive power.
 [4] arXiv:2201.07896 [pdf, other]

Title: Generative Models for Periodicity Detection in Noisy SignalsSubjects: Methodology (stat.ME); Applications (stat.AP)
We introduce a new periodicity detection algorithm for binary time series of event onsets, the Gaussian Mixture Periodicity Detection Algorithm (GMPDA). The algorithm approaches the periodicity detection problem to infer the parameters of a generative model. We specified two models  the Clock and Random Walk  which describe two different periodic phenomena and provide a generative framework. The algorithm achieved strong results on test cases for single and multiple periodicity detection and varying noise levels. The performance of GMPDA was also evaluated on real data, recorded leg movements during sleep, where GMPDA was able to identify the expected periodicities despite high noise levels. The paper's key contributions are two new models for generating periodic event behavior and the GMPDA algorithm for multiple periodicity detection, which is highly accurate under noise.
 [5] arXiv:2201.07910 [pdf, other]

Title: A ComplexLASSO Approach for Localizing Forced Oscillations in Power SystemsComments: 5 pages, submitted to IEEE PESGM 2022Subjects: Applications (stat.AP)
We study the problem of localizing multiple sources of forced oscillations (FOs) and estimating their characteristics, such as frequency, phase, and amplitude, using noisy PMU measurements. For each source location, we model the input oscillation as a sum of unknown sinusoidal terms. This allows us to obtain a linear relationship between measurements and the inputs at the unknown sinusoids' frequencies in the frequency domain. We determine these frequencies by thresholding the empirical spectrum of the noisy measurements. Assuming sparsity in the number of FOs' locations and the number of sinusoids at each location, we cast the location recovery problem as an $\ell_1$regularized least squares problem in the complex domain  i.e., complexLASSO (linear shrinkage and selection operator). We numerically solve this optimization problem using the complexvalued coordinate descent method, and show its efficiency on the IEEE 68bus, 16 machine and WECC 179bus, 29machine systems.
 [6] arXiv:2201.07945 [pdf, other]

Title: A Guideline for the Statistical Analysis of Compositional Data in ImmunologySubjects: Applications (stat.AP)
The study of immune cellular composition is of great scientific interest in immunology and multiple largescale data have also been generated recently to support this investigation. From the statistical point of view, such immune cellular composition data corresponds to compositional data that conveys relative information. In compositional data, each element is positive and all the elements together sum to a constant, which can be set to one in general. Standard statistical methods are not directly applicable for the analysis of compositional data because they do not appropriately handle correlations among elements in the compositional data. As this type of data has become more widely available, investigation of optimal statistical strategies considering compositional features in data became more in great need. In this paper, we review statistical methods for compositional data analysis and illustrate them in the context of immunology. Specifically, we focus on regression analyses using logratio and Dirichlet approaches, discuss their theoretical foundations, and illustrate their applications with immune cellular fraction data generated from colorectal cancer patients.
 [7] arXiv:2201.07998 [pdf]

Title: Statistical Learning for Individualized Asset AllocationSubjects: Machine Learning (stat.ML); Statistics Theory (math.ST)
We establish a highdimensional statistical learning framework for individualized asset allocation. Our proposed methodology addresses continuousaction decisionmaking with a large number of characteristics. We develop a discretization approach to model the effect from continuous actions and allow the discretization level to be large and diverge with the number of observations. The value function of continuousaction is estimated using penalized regression with generalized penalties that are imposed on linear transformations of the model coefficients. We show that our estimators using generalized folded concave penalties enjoy desirable theoretical properties and allow for statistical inference of the optimal value associated with optimal decisionmaking. Empirically, the proposed framework is exercised with the Health and Retirement Study data in finding individualized optimal asset allocation. The results show that our individualized optimal strategy improves individual financial wellbeing and surpasses benchmark strategies.
 [8] arXiv:2201.08003 [pdf, other]

Title: Inference in Highdimensional Multivariate Response Regression with Hidden VariablesSubjects: Methodology (stat.ME)
This paper studies the inference of the regression coefficient matrix under multivariate response linear regressions in the presence of hidden variables. A novel procedure for constructing confidence intervals of entries of the coefficient matrix is proposed. Our method first utilizes the multivariate nature of the responses by estimating and adjusting the hidden effect to construct an initial estimator of the coefficient matrix. By further deploying a lowdimensional projection procedure to reduce the bias introduced by the regularization in the previous step, a refined estimator is proposed and shown to be asymptotically normal. The asymptotic variance of the resulting estimator is derived with closedform expression and can be consistently estimated. In addition, we propose a testing procedure for the existence of hidden effects and provide its theoretical justification. Both our procedures and their analyses are valid even when the feature dimension and the number of responses exceed the sample size. Our results are further backed up via extensive simulations and a real data analysis.
 [9] arXiv:2201.08012 [pdf, other]

Title: Entropy Balancing for Generalizing Causal Estimation with Summarylevel InformationSubjects: Methodology (stat.ME)
In this paper, we focus on estimating the average treatment effect (ATE) of a target population when individuallevel data from a source population and summarylevel data (e.g., first or second moments of certain covariates) from the target population are available. In the presence of heterogeneous treatment effect, the ATE of the target population can be different from that of the source population when distributions of treatment effect modifiers are dissimilar in these two populations, a phenomenon also known as covariate shift. Many methods have been developed to adjust for covariate shift, but most require individual covariates from the target population. We develop a weighting approach based on summarylevel information from the target population to adjust for possible covariate shift in effect modifiers. In particular, weights of the treated and control groups within the source population are calibrated by the summarylevel information of the target population. In addition, our approach also seeks additional covariate balance between the treated and control groups in the source population. We study the asymptotic behavior of the corresponding weighted estimator for the target population ATE under a wide range of conditions. The theoretical implications are confirmed in simulation studies and a real data application.
 [10] arXiv:2201.08044 [pdf, ps, other]

Title: Metropolis Augmented Hamiltonian Monte CarloAuthors: Guangyao ZhouComments: Symposium on Advances in Approximate Bayesian Inference (AABI) 2022Subjects: Computation (stat.CO)
Hamiltonian Monte Carlo (HMC) is a powerful Markov Chain Monte Carlo (MCMC) method for sampling from complex highdimensional continuous distributions. However, in many situations it is necessary or desirable to combine HMC with other MetropolisHastings (MH) samplers. The common HMCwithinGibbs strategy implies a tradeoff between long HMC trajectories and more frequent other MH updates. Addressing this tradeoff has been the focus of several recent works. In this paper we propose Metropolis Augmented Hamiltonian Monte Carlo (MAHMC), an HMC variant that allows MH updates within HMC and eliminates this tradeoff. Experiments on two representative examples demonstrate MAHMC's efficiency and ease of use when compared with withinGibbs alternatives.
 [11] arXiv:2201.08053 [pdf, ps, other]

Title: Bayesian Fused Lasso Modeling via Horseshoe PriorComments: 17 pagesSubjects: Methodology (stat.ME)
Bayesian fused lasso is one of the sparse Bayesian methods, which shrinks both regression coefficients and their successive differences simultaneously. In this paper, we propose a Bayesian fused lasso modeling via horseshoe prior. By assuming a horseshoe prior on the difference of successive regression coefficients, the proposed method enables us to prevent overshrinkage of those differences. We also propose a Bayesian hexagonal operator for regression with shrinkage and equality selection (HORSES) with horseshoe prior, which imposes priors on all combinations of differences of regression coefficients. Simulation studies and an application to real data show that the proposed method gives better performance than existing methods.
 [12] arXiv:2201.08057 [pdf, other]

Title: Nonnested model selection based on empirical likelihoodComments: 31 pages for main body and 15 pages for supplementary material, 4 tablesSubjects: Methodology (stat.ME)
We propose an empirical likelihood ratio test for nonparametric model selection, where the competing models may be nested, nonnested, overlapping, misspecified, or correctly specified. It compares the squared prediction errors of models based on the crossvalidation and allows for heteroscedasticity of the errors of models. We develop its asymptotic distributions for comparing additive models and varyingcoefficient models and extend it to test significance of variables in additive models with massive data. The method is applicable to model selection among supervised learning models. To facilitate implementation of the test, we provide a fast calculation procedure. Simulations show that the proposed tests work well and have favorable finite sample performance over some existing approaches. The methodology is validated on an empirical application.
 [13] arXiv:2201.08072 [pdf, other]

Title: Geometrically adapted Langevin dynamics for Markov chain Monte Carlo simulationsComments: 43 pages, 9 figuresSubjects: Applications (stat.AP); Computation (stat.CO)
Markov Chain Monte Carlo (MCMC) is one of the most powerful methods to sample from a given probability distribution, of which the Metropolis Adjusted Langevin Algorithm (MALA) is a variant wherein the gradient of the distribution is used towards faster convergence. However, being set up in the Euclidean framework, MALA might perform poorly in higher dimensional problems or in those involving anisotropic densities as the underlying nonEuclidean aspects of the geometry of the sample space remain unaccounted for. We make use of concepts from differential geometry and stochastic calculus on Riemannian manifolds to geometrically adapt a stochastic differential equation with a nontrivial drift term. This adaptation is also referred to as a stochastic development. We apply this method specifically to the Langevin diffusion equation and arrive at a geometrically adapted Langevin dynamics. This new approach far outperforms MALA, certain manifold variants of MALA, and other approaches such as Hamiltonian Monte Carlo (HMC), its adaptive variant the noUturn sampler (NUTS) implemented in Stan, especially as the dimension of the problem increases where often GALA is actually the only successful method. This is evidenced through several numerical examples that include parameter estimation of a broad class of probability distributions and a logistic regression problem.
 [14] arXiv:2201.08082 [pdf, other]

Title: Kernel Methods and Multilayer Perceptrons Learn Linear Models in High DimensionsAuthors: Mojtaba SahraeeArdakan, Melikasadat Emami, Parthe Pandit, Sundeep Rangan, Alyson K. FletcherSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Empirical observation of high dimensional phenomena, such as the double descent behaviour, has attracted a lot of interest in understanding classical techniques such as kernel methods, and their implications to explain generalization properties of neural networks. Many recent works analyze such models in a certain highdimensional regime where the covariates are independent and the number of samples and the number of covariates grow at a fixed ratio (i.e. proportional asymptotics). In this work we show that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime. More surprisingly, when the data is generated by a kernel model where the relationship between input and the response could be very nonlinear, we show that linear models are in fact optimal, i.e. linear models achieve the minimum risk among all models, linear or nonlinear. These results suggest that more complex models for the data other than independent features are needed for highdimensional analysis.
 [15] arXiv:2201.08153 [pdf, other]

Title: Bayesian Nonparametric Mixtures of Exponential Random Graph Models for Ensembles of NetworksSubjects: Methodology (stat.ME); Computation (stat.CO)
Ensembles of networks arise in various fields where multiple independent networks are observed on the same set of nodes, for example, a collection of brain networks constructed on the same brain regions for different individuals. However, there are few models that describe both the variations and characteristics of networks in an ensemble at the same time. In this paper, we propose to model the ensemble of networks using a Dirichlet Process Mixture of Exponential Random Graph Models (DPMERGMs), which divides the ensemble into different clusters and models each cluster of networks using a separate Exponential Random Graph Model (ERGM). By employing a Dirichlet process mixture, the number of clusters can be determined automatically and changed adaptively with the data provided. Moreover, in order to perform full Bayesian inference for DPMERGMs, we employ the intermediate importance sampling technique inside the Metropoliswithinslice sampling scheme, which addressed the problem of sampling from the intractable ERGMs on an infinite sample space. We also demonstrate the performance of DPMERGMs with both simulated and real datasets.
 [16] arXiv:2201.08171 [pdf, other]

Title: Use of Simulation Models for the Development of a Statistical Production Framework for Mobile Network Data with the simutils PackageComments: 17 pages, 11 figures, presented at the Conference Use of R in Official Statistics 2021, 2426 November 2021, Bucharest (Romania)Subjects: Applications (stat.AP); Methodology (stat.ME)
We propose to use agentbased simulation models for the development of statistical methods in Official Statistics, especially in relation with the new digital data sources. We present a mobile network data simulator which is managed through the simutils R package which provides geospatial representations of the simulated data. While the synthetic data are produced by an external tool, our simutils package allows an R user to parameterize and run this external simulation tool, to build geospatial data structures from the simulation output or to compute several aggregates. The geospatial data structures were designed with the purpose of using them in a visualization package too. Useful simulation models require the incorporation of real metadata from mobile telecommunication networks driving us to the inclusion of functionalities allowing the user to specify and validate them. All metadata are specified using XML file whose structure are defined in corresponding XSD files. Our R package includes example data sets and we show here how validate the metadata, how to run a simulation and how build the geospatial data structures and how to compute different aggregates.
 [17] arXiv:2201.08180 [pdf, other]

Title: Sequential Bayesian Inference for Uncertain Nonlinear Dynamic Systems: A TutorialSubjects: Methodology (stat.ME); Systems and Control (eess.SY)
In this article, an overview of Bayesian methods for sequential simulation from posterior distributions of nonlinear and nonGaussian dynamic systems is presented. The focus is mainly laid on sequential Monte Carlo methods, which are based on particle representations of probability densities and can be seamlessly generalized to any statespace representation. Within this context, a unified framework of the various Particle Filter (PF) alternatives is presented for the solution of state, stateparameter and inputstateparameter estimation problems on the basis of sparse measurements. The algorithmic steps of each filter are thoroughly presented and a simple illustrative example is utilized for the inference of i) unobserved states, ii) unknown system parameters and iii) unmeasured driving inputs.
 [18] arXiv:2201.08226 [pdf, other]

Title: SketchandLift: Scalable Subsampled Semidefinite Program for $K$means ClusteringSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed $K$means clustering. The proposed sketchandlift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearestcentroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the $K$means SDP on the full dataset, which is known to be informationtheoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms stateoftheart fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original $K$means SDP with substantially reduced runtime.
 [19] arXiv:2201.08283 [pdf, other]

Title: Leadlag detection and network clustering for multivariate time series with an application to the US equity marketComments: 29 pages, 28 figures; preliminary version appeared at KDD 2021  7th SIGKKDD Workshop on Mining and Learning from Time Series (MiLeTS)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistical Finance (qfin.ST); Methodology (stat.ME)
In multivariate time series systems, it has been observed that certain groups of variables partially lead the evolution of the system, while other variables follow this evolution with a time delay; the result is a leadlag structure amongst the time series variables. In this paper, we propose a method for the detection of leadlag clusters of time series in multivariate systems. We demonstrate that the web of pairwise leadlag relationships between time series can be helpfully construed as a directed network, for which there exist suitable algorithms for the detection of pairs of leadlag clusters with high pairwise imbalance. Within our framework, we consider a number of choices for the pairwise leadlag metric and directed network clustering components. Our framework is validated on both a synthetic generative model for multivariate leadlag time series systems and daily realworld US equity prices data. We showcase that our method is able to detect statistically significant leadlag clusters in the US equity market. We study the nature of these clusters in the context of the empirical finance literature on leadlag relations and demonstrate how these can be used for the construction of predictive financial signals.
 [20] arXiv:2201.08302 [pdf, other]

Title: The R Package HCV for Hierarchical Clustering from VertexlinksComments: 12 pages, 7 figuresSubjects: Computation (stat.CO); Applications (stat.AP)
The HCV package implements the hierarchical clustering for spatial data. It requires clustering results not only homogeneous in nongeographical features among samples but also geographically close to each other within a cluster. We modified typically used hierarchical agglomerative clustering algorithms to introduce the spatial homogeneity, by considering geographical locations as vertices and converting spatial adjacency into whether a shared edge exists between a pair of vertices. The main function HCV obeying constraints of the vertex links automatically enforces the spatial contiguity property at each step of iterations. In addition, two methods to find an appropriate number of clusters and to report cluster members are also provided.
 [21] arXiv:2201.08311 [pdf, other]

Title: Accelerated Gradient Flow: Risk, Stability, and Implicit RegularizationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
Acceleration and momentum are the de facto standard in modern applications of machine learning and optimization, yet the bulk of the work on implicit regularization focuses instead on unaccelerated methods. In this paper, we study the statistical risk of the iterates generated by Nesterov's accelerated gradient method and Polyak's heavy ball method, when applied to least squares regression, drawing several connections to explicit penalization. We carry out our analyses in continuoustime, allowing us to make sharper statements than in prior work, and revealing complex interactions between early stopping, stability, and the curvature of the loss function.
 [22] arXiv:2201.08315 [pdf, other]

Title: Predictive Inference with Weak SupervisionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The expense of acquiring labels in largescale statistical machine learning makes partially and weaklylabeled data attractive, though it is not always apparent how to leverage such data for model fitting or validation. We present a methodology to bridge the gap between partial supervision and validation, developing a conformal prediction framework to provide valid predictive confidence sets  sets that cover a true label with a prescribed probability, independent of the underlying distribution  using weakly labeled data. To do so, we introduce a (necessary) new notion of coverage and predictive validity, then develop several application scenarios, providing efficient algorithms for classification and several largescale structured prediction problems. We corroborate the hypothesis that the new coverage definition allows for tighter and more informative (but valid) confidence sets through several experiments.
 [23] arXiv:2201.08323 [pdf, other]

Title: Parallel and distributed Bayesian modelling for analysing highdimensional spatiotemporal count dataSubjects: Methodology (stat.ME); Computation (stat.CO)
This paper proposes a general procedure to analyse highdimensional spatiotemporal count data, with special emphasis on relative risks estimation in cancer epidemiology. Model fitting is carried out using integrated nested Laplace approximations over a partition of the spatiotemporal domain. This is a simple idea that works very well in this context as the models are defined to borrow strength locally in space and time, providing reliable risk estimates. Parallel and distributed strategies are proposed to speed up computations in a setting where Bayesian model fitting is generally prohibitively timeconsuming and even unfeasible. We evaluate the whole procedure in a simulation study with a twofold objective: to estimate risks accurately and to detect extreme risk areas while avoiding false positives/negatives. We show that our method outperforms classical global models. A real data analysis comparing the global models and the new procedure is also presented.
 [24] arXiv:2201.08326 [pdf, other]

Title: Learning with latent group sparsity via heat flow dynamics on networksComments: 36 pages, 3 figures, 3 tablesSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Computation (stat.CO); Machine Learning (stat.ML)
Group or cluster structure on explanatory variables in machine learning problems is a very general phenomenon, which has attracted broad interest from practitioners and theoreticians alike. In this work we contribute an approach to learning under such group structure, that does not require prior information on the group identities. Our paradigm is motivated by the Laplacian geometry of an underlying network with a related community structure, and proceeds by directly incorporating this into a penalty that is effectively computed via a heat flowbased local network dynamics. In fact, we demonstrate a procedure to construct such a network based on the available data. Notably, we dispense with computationally intensive preprocessing involving clustering of variables, spectral or otherwise. Our technique is underpinned by rigorous theorems that guarantee its effective performance and provide bounds on its sample complexity. In particular, in a wide range of settings, it provably suffices to run the heat flow dynamics for time that is only logarithmic in the problem dimensions. We explore in detail the interfaces of our approach with key statistical physics models in network science, such as the Gaussian Free Field and the Stochastic Block Model. We validate our approach by successful applications to realworld data from a wide array of application domains, including computer science, genetics, climatology and economics. Our work raises the possibility of applying similar diffusionbased techniques to classical learning tasks, exploiting the interplay between geometric, dynamical and stochastic structures underlying the data.
 [25] arXiv:2201.08343 [pdf, other]

Title: Using Machine Learning to Test Causal Hypotheses in Conjoint AnalysisSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Conjoint analysis is a popular experimental design used to measure multidimensional preferences. Researchers examine how varying a factor of interest, while controlling for other relevant factors, influences decisionmaking. Currently, there exist two methodological approaches to analyzing data from a conjoint experiment. The first focuses on estimating the average marginal effects of each factor while averaging over the other factors. Although this allows for straightforward designbased estimation, the results critically depend on the distribution of other factors and how interaction effects are aggregated. An alternative modelbased approach can compute various quantities of interest, but requires researchers to correctly specify the model, a challenging task for conjoint analysis with many factors and possible interactions. In addition, a commonly used logistic regression has poor statistical properties even with a moderate number of factors when incorporating interactions. We propose a new hypothesis testing approach based on the conditional randomization test to answer the most fundamental question of conjoint analysis: Does a factor of interest matter in any way given the other factors? Our methodology is solely based on the randomization of factors, and hence is free from assumptions. Yet, it allows researchers to use any test statistic, including those based on complex machine learning algorithms. As a result, we are able to combine the strengths of the existing designbased and modelbased approaches. We illustrate the proposed methodology through conjoint analysis of immigration preferences and political candidate evaluation. We also extend the proposed approach to test for regularity assumptions commonly used in conjoint analysis.
 [26] arXiv:2201.08349 [pdf, ps, other]

Title: Heavytailed Sampling via Transformed Unadjusted Langevin AlgorithmSubjects: Statistics Theory (math.ST); Computation (stat.CO); Machine Learning (stat.ML)
We analyze the oracle complexity of sampling from polynomially decaying heavytailed target densities based on running the Unadjusted Langevin Algorithm on certain transformed versions of the target density. The specific class of closedform transformation maps that we construct are shown to be diffeomorphisms, and are particularly suited for developing efficient diffusionbased samplers. We characterize the precise class of heavytailed densities for which polynomialorder oracle complexities (in dimension and inverse target accuracy) could be obtained, and provide illustrative examples. We highlight the relationship between our assumptions and functional inequalities (super and weak Poincar\'e inequalities) based on nonlocal Dirichlet forms defined via fractional Laplacian operators, used to characterize the heavytailed equilibrium densities of certain stabledriven stochastic differential equations.
 [27] arXiv:2201.08362 [pdf, ps, other]

Title: Generalised functional additive mixed models with compositional covariates for areal Covid19 incidence curvesComments: submitted for publicationSubjects: Applications (stat.AP); Methodology (stat.ME)
We extend the generalised functional additive mixed model to include (functional) compositional covariates carrying relative information of a whole. Relying on the isometric isomorphism of the Bayes Hilbert space of probability densities with a subspace of the $L^2$, we include functional compositions as transformed functional covariates with constrained effect function. The extended model allows for the estimation of linear, nonlinear and timevarying effects of scalar and functional covariates, as well as (correlated) functional random effects, in addition to the compositional effects. We use the model to estimate the effect of the age, sex and smoking (functional) composition of the population on regional Covid19 incidence data for Spain, while accounting for climatological and sociodemographic covariate effects and spatial correlation.
Crosslists for Fri, 21 Jan 22
 [28] arXiv:2201.07401 (crosslist from math.ST) [pdf, other]

Title: Multiway Spherical Clustering via DegreeCorrected Tensor Block ModelsSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
We consider the problem of multiway clustering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recommendation system, neuroimaging, community detection, and hypergraph partitions in social networks. The allowance of degree heterogeneity provides great flexibility in clustering models, but the extra complexity poses significant challenges in both statistics and computation. Here, we develop a degreecorrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of angle separability, and we characterize three signaltonoise regimes corresponding to different statisticalcomputational behaviors. In particular, we demonstrate that an intrinsic statisticaltocomputational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomialtime algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through two data applications, one on human brain connectome project, and another on Peru Legislation network dataset.
 [29] arXiv:2201.07907 (crosslist from math.OC) [pdf, other]

Title: Localization and Estimation of Unknown Forced Inputs: A Group LASSO ApproachComments: 12 pages, 5 figures, submitted to IEEE Transactions on Control of Network SystemsSubjects: Optimization and Control (math.OC); Methodology (stat.ME)
We model and study the problem of localizing a set of sparse forcing inputs for linear dynamical systems from noisy measurements when the initial state is unknown. This problem is of particular relevance to detecting forced oscillations in electric power networks. We express measurements as an additive model comprising the initial state and inputs grouped over time, both expanded in terms of the basis functions (i.e., impulse response coefficients). Using this model, with probabilistic guarantees, we recover the locations and simultaneously estimate the initial state and forcing inputs using a variant of the group LASSO (linear absolute shrinkage and selection operator) method. Specifically, we provide a tight upper bound on: (i) the probability that the group LASSO estimator wrongly identifies the source locations, and (ii) the $\ell_2$norm of the estimation error. Our bounds explicitly depend upon the length of the measurement horizon, the noise statistics, the number of inputs and sensors, and the singular values of impulse response matrices. Our theoretical analysis is one of the first to provide a complete treatment for the group LASSO estimator for linear dynamical systems under inputtooutput delay assumptions. Finally, we validate our results on synthetic models and the IEEE 68bus, 16machine system.
 [30] arXiv:2201.07912 (crosslist from cs.LG) [pdf, other]

Title: CommunicationEfficient Device Scheduling for Federated Learning Using Stochastic OptimizationComments: To be included in Proceedings of INFOCOM 2022, 10 Pages, 5 FiguresSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (stat.ML)
Federated learning (FL) is a useful tool in distributed machine learning that utilizes users' local datasets in a privacypreserving manner. When deploying FL in a constrained wireless environment; however, training models in a timeefficient manner can be a challenging task due to intermittent connectivity of devices, heterogeneous connection quality, and noni.i.d. data. In this paper, we provide a novel convergence analysis of nonconvex loss functions using FL on both i.i.d. and noni.i.d. datasets with arbitrary device selection probabilities for each round. Then, using the derived convergence bound, we use stochastic optimization to develop a new client selection and power allocation algorithm that minimizes a function of the convergence bound and the average communication time under a transmit power constraint. We find an analytical solution to the minimization problem. One key feature of the algorithm is that knowledge of the channel statistics is not required and only the instantaneous channel state information needs to be known. Using the FEMNIST and CIFAR10 datasets, we show through simulations that the communication time can be significantly decreased using our algorithm, compared to uniformly random participation.
 [31] arXiv:2201.07915 (crosslist from cs.IT) [pdf, other]

Title: Sensing Method for TwoTarget Detection in TimeConstrained Vector Poisson ChannelComments: 24 pages, 37 figures, journal articleJournalref: Signal & Image Processing: An International Journal (SIPIJ) Vol. 12, No. 6, December 2021Subjects: Information Theory (cs.IT); Signal Processing (eess.SP); Computation (stat.CO)
It is an experimental design problem in which there are two Poisson sources with two possible and known rates, and one counter. Through a switch, the counter can observe the sources individually or the counts can be combined so that the counter observes the sum of the two. The sensor scheduling problem is to determine an optimal proportion of the available time to be allocated toward individual and joint sensing, under a total time constraint. Two different metrics are used for optimization: mutual information between the sources and the observed counts, and probability of detection for the associated source detection problem. Our results, which are primarily computational, indicate similar but not identical results under the two cost functions.
 [32] arXiv:2201.08027 (crosslist from cs.CV) [pdf, ps, other]

Title: A Joint Morphological Profiles and Patch Tensor Change Detection for Hyperspectral ImagerySubjects: Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
Multitemporal hyperspectral images can be used to detect changed information, which has gradually attracted researchers' attention. However, traditional change detection algorithms have not deeply explored the relevance of spatial and spectral changed features, which leads to low detection accuracy. To better excavate both spectral and spatial information of changed features, a joint morphology and patchtensor change detection (JMPT) method is proposed. Initially, a patchbased tensor strategy is adopted to exploit similar property of spatial structure, where the nonoverlapping local patch image is reshaped into a new tensor cube, and then threeorder Tucker decompositon and image reconstruction strategies are adopted to obtain more robust multitemporal hyperspectral datasets. Meanwhile, multiple morphological profiles including maxtree and mintree are applied to extract different attributes of multitemporal images. Finally, these results are fused to general a final change detection map. Experiments conducted on two real hyperspectral datasets demonstrate that the proposed detector achieves better detection performance.
 [33] arXiv:2201.08105 (crosslist from cs.LG) [pdf, other]

Title: Statistical Depth Functions for Ranking Distributions: Definitions, Statistical Learning and ApplicationsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The concept of median/consensus has been widely investigated in order to provide a statistical summary of ranking data, i.e. realizations of a random permutation $\Sigma$ of a finite set, $\{1,\; \ldots,\; n\}$ with $n\geq 1$ say. As it sheds light onto only one aspect of $\Sigma$'s distribution $P$, it may neglect other informative features. It is the purpose of this paper to define analogs of quantiles, ranks and statistical procedures based on such quantities for the analysis of ranking data by means of a metricbased notion of depth function on the symmetric group. Overcoming the absence of vector space structure on $\mathfrak{S}_n$, the latter defines a centeroutward ordering of the permutations in the support of $P$ and extends the classic metricbased formulation of consensus ranking (medians corresponding then to the deepest permutations). The axiomatic properties that ranking depths should ideally possess are listed, while computational and generalization issues are studied at length. Beyond the theoretical analysis carried out, the relevance of the novel concepts and methods introduced for a wide variety of statistical tasks are also supported by numerous numerical experiments.
 [34] arXiv:2201.08115 (crosslist from cs.AI) [pdf, other]

Title: Priors, Hierarchy, and Information Asymmetry for Skill Transfer in Reinforcement LearningSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
The ability to discover behaviours from past experience and transfer them to new tasks is a hallmark of intelligent agents acting sampleefficiently in the real world. Equipping embodied reinforcement learners with the same ability may be crucial for their successful deployment in robotics. While hierarchical and KLregularized RL individually hold promise here, arguably a hybrid approach could combine their respective benefits. Key to these fields is the use of information asymmetry to bias which skills are learnt. While asymmetric choice has a large influence on transferability, prior works have explored a narrow range of asymmetries, primarily motivated by intuition. In this paper, we theoretically and empirically show the crucial tradeoff, controlled by information asymmetry, between the expressivity and transferability of skills across sequential tasks. Given this insight, we provide a principled approach towards choosing asymmetry and apply our approach to a complex, robotic block stacking domain, unsolvable by baselines, demonstrating the effectiveness of hierarchical KLregularized RL, coupled with correct asymmetric choice, for sampleefficient transfer learning.
 [35] arXiv:2201.08262 (crosslist from cs.LG) [pdf, other]

Title: Generalizing OffPolicy Evaluation From a Causal Perspective For Sequential DecisionMakingSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Assessing the effects of a policy based on observational data from a different policy is a common problem across several highstake decisionmaking domains, and several offpolicy evaluation (OPE) techniques have been proposed. However, these methods largely formulate OPE as a problem disassociated from the process used to generate the data (i.e. structural assumptions in the form of a causal graph). We argue that explicitly highlighting this association has important implications on our understanding of the fundamental limits of OPE. First, this implies that current formulation of OPE corresponds to a narrow set of tasks, i.e. a specific causal estimand which is focused on prospective evaluation of policies over populations or subpopulations. Second, we demonstrate how this association motivates natural desiderata to consider a general set of causal estimands, particularly extending the role of OPE for counterfactual offpolicy evaluation at the level of individuals of the population. A precise description of the causal estimand highlights which OPE estimands are identifiable from observational data under the stated generative assumptions. For those OPE estimands that are not identifiable, the causal perspective further highlights where more experimental data is necessary, and highlights situations where human expertise can aid identification and estimation. Furthermore, many formalisms of OPE overlook the role of uncertainty entirely in the estimation process.We demonstrate how specifically characterising the causal estimand highlights the different sources of uncertainty and when human expertise can naturally manage this uncertainty. We discuss each of these aspects as actionable desiderata for future OPE research at scale and inline with practical utility.
 [36] arXiv:2201.08288 (crosslist from cs.DS) [pdf, ps, other]

Title: Scalable $k$d trees for distributed dataComments: 34 pages, 3 figures; submitted for publicationSubjects: Data Structures and Algorithms (cs.DS); Computational Engineering, Finance, and Science (cs.CE); Computation (stat.CO)
Data structures known as $k$d trees have numerous applications in scientific computing, particularly in areas of modern statistics and data science such as range search in decision trees, clustering, nearest neighbors search, local regression, and so forth. In this article we present a scalable mechanism to construct $k$d trees for distributed data, based on approximating medians for each recursive subdivision of the data. We provide theoretical guarantees of the quality of approximation using this approach, along with a simulation study quantifying the accuracy and scalability of our proposed approach in practice.
Replacements for Fri, 21 Jan 22
 [37] arXiv:1802.02219 (replaced) [pdf, other]

Title: Practical Transfer Learning for Bayesian OptimizationSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI)
 [38] arXiv:1902.09602 (replaced) [pdf, other]

Title: Analyzing Data Selection Techniques with Tools from the Theory of Information LossesComments: This paper has now been published as a conference proceeding in IEEE Big Data 2021Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [39] arXiv:1903.09668 (replaced) [pdf, ps, other]

Title: Data Augmentation for Bayesian Deep LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
 [40] arXiv:2003.00470 (replaced) [pdf]

Title: Dimensionality reduction to maximize prediction generalization capabilityJournalref: Nature Machine Intelligence 3, 434446 (2021)Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
 [41] arXiv:2007.08911 (replaced) [pdf, other]

Title: Technologies for Trustworthy Machine Learning: A Survey in a SocioTechnical ContextAuthors: Ehsan Toreini, Mhairi Aitken, Kovila P. L. Coopamootoo, Karen Elliott, Vladimiro Gonzalez Zelaya, Paolo Missier, Magdalene Ng, Aad van MoorselComments: We are updating some sections to include more recent advancesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (stat.ML)
 [42] arXiv:2009.01235 (replaced) [pdf, other]

Title: Quantum Discriminator for Binary ClassificationSubjects: Quantum Physics (quantph); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [43] arXiv:2102.08573 (replaced) [pdf, other]

Title: Robust Mean Estimation in High Dimensions: An Outlier Fraction Agnostic and Efficient AlgorithmComments: arXiv admin note: text overlap with arXiv:2008.09239Subjects: Applications (stat.AP); Information Theory (cs.IT)
 [44] arXiv:2103.03370 (replaced) [pdf, other]

Title: Multitask Learning with HighDimensional Noisy ImagesSubjects: Methodology (stat.ME)
 [45] arXiv:2103.10027 (replaced) [pdf, other]

Title: Probabilistic Simplex Component AnalysisSubjects: Signal Processing (eess.SP); Machine Learning (stat.ML)
 [46] arXiv:2104.01165 (replaced) [pdf, other]

Title: Distributional data analysis of accelerometer data from the NHANES database using nonparametric survey regression modelsSubjects: Methodology (stat.ME); Applications (stat.AP); Other Statistics (stat.OT)
 [47] arXiv:2106.01282 (replaced) [pdf, other]

Title: Spectral embedding for dynamic networks with stability guaranteesComments: NeurIPS 2021Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [48] arXiv:2107.06621 (replaced) [pdf, other]

Title: Rough McKeanVlasov dynamics for robust ensemble Kalman filteringComments: 44 pages, 7 figuresSubjects: Probability (math.PR); Numerical Analysis (math.NA); Statistics Theory (math.ST)
 [49] arXiv:2109.01654 (replaced) [pdf, other]

Title: Multiagent Natural Actorcritic Reinforcement Learning AlgorithmsComments: A very highlevel summary of our revision is: In Section 3.5, we theoretically prove that the objective function value from the deterministic variant of MAN algorithms dominates that of the MAAC algorithm under some minimal conditions. It relies on the Lemma 2 of our paper: the minimum singular value of the Fisher information matrix is well within the reciprocal of the policy parameter dimensionSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [50] arXiv:2109.01785 (replaced) [pdf, other]

Title: Node Feature Kernels Increase Graph Convolutional Network RobustnessComments: 16 pages, 5 figuresSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
 [51] arXiv:2109.14206 (replaced) [pdf, other]

Title: Exact Statistical Inference for the Wasserstein Distance by Selective InferenceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [52] arXiv:2110.00224 (replaced) [pdf, other]

Title: Censored autoregressive regression models with Student$t$ innovationsComments: 22 pages, 10 figures and 3 tablesSubjects: Methodology (stat.ME); Applications (stat.AP)
 [53] arXiv:2110.00533 (replaced) [pdf, ps, other]

Title: Relative Contagiousness of Emerging Virus Variants: An Analysis of the Alpha, Delta, and Omicron SARSCoV2 VariantsAuthors: Peter Reinhard HansenSubjects: Econometrics (econ.EM); Applications (stat.AP)
 [54] arXiv:2110.02128 (replaced) [pdf, other]

Title: NeurWIN: Neural Whittle Index Network For Restless Bandits Via Deep RLComments: Accepted for publication in NeurIPS 2021Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [55] arXiv:2110.06623 (replaced) [pdf, other]

Title: SSSNET: SemiSupervised Signed Network ClusteringComments: 14 pagesSubjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [56] arXiv:2110.12399 (replaced) [pdf, other]

Title: BINAS: Bilinear Interpretable Neural Architecture SearchComments: The full code is released at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [57] arXiv:2111.08118 (replaced) [pdf, other]

Title: NeuroHotnet: A Graph Theoretic Approach for Brain FC EstimationComments: 36 pages, 10 figures, 3 tables, 2 algorithmsSubjects: Applications (stat.AP); Social and Information Networks (cs.SI); Neurons and Cognition (qbio.NC)
 [58] arXiv:2111.14000 (replaced) [pdf, other]

Title: Factoraugmented tree ensemblesAuthors: Filippo PellegrinoSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
 [59] arXiv:2112.07602 (replaced) [pdf, other]

Title: A Framework for the MetaAnalysis of Randomized Experiments with Applications to HeavyTailed Response DataAuthors: Nilesh Tripuraneni, Dhruv Madeka, Dean Foster, Dominique PerraultJoncas, Michael I. JordanSubjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
 [60] arXiv:2112.07611 (replaced) [pdf, other]

Title: Speeding up Learning Quantum States through Group Equivariant Convolutional Quantum AnsätzeComments: 16 pages, 12 figuresSubjects: Quantum Physics (quantph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (mathph); Machine Learning (stat.ML)
 [61] arXiv:2201.05773 (replaced) [pdf, other]

Title: Automated causal inference in application to randomized controlled clinical trialsAuthors: Jiqing Wu, Nanda Horeweg, Marco de Bruyn, Remi A. Nout, Ina M. JürgenliemkSchulz, Ludy C.H.W. Lutgens, Jan J. Jobsen, Elzbieta M. van der SteenBanasik, Hans W. Nijman, Vincent T.H.B.M. Smit, Tjalling Bosse, Carien L. Creutzberg, Viktor H. KoelzerComments: Submitted to Nature Machine Intelligence. The code is publicly available via this https URLSubjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
 [62] arXiv:2201.06604 (replaced) [pdf, other]

Title: A tool set for random number generation on GPUs in RSubjects: Computation (stat.CO); Applications (stat.AP)
 [63] arXiv:2201.06616 (replaced) [pdf, other]

Title: Improving the quality control of seismic data through active learningComments: 10 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[ showing up to 2000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2201, contact, help (Access key information)