Statistics
New submissions
[ showing up to 500 entries per page: fewer | more ]
New submissions for Thu, 28 Mar 24
- [1] arXiv:2403.17948 [pdf, ps, other]
-
Title: The Rule of link functions on Binomial Regression Model: A Cross Sectional Study on Child Malnutrition, BangladeshAuthors: Md Mehedi Hasan BhuiyanSubjects: Methodology (stat.ME)
Link function is a key tool in the binomial regression model defined as non-linear model under GLM approach. It transforms the nonlinear regression to linear model with converting the interval (-\infty,\infty) to the probability [0,1]. The binomial model with link functions (logit, probit, cloglog and cauchy) are applied on the proportional of child malnutrition age 0-5 years in each household level. Multiple Indicator Cluster survey (MICS)-2019, Bangladesh was conducted by a joint cooperation of UNICEF and BBS . The survey covered 64000 households using two stage stratified sampling technique, where around 21000 household have children age 0-5 years. We use bi-variate analysis to find the statistical association between response and sociodemographic features. In the binary regression model, probit model provides the best result based on the lowest standard error of covariates and goodness of fit test (deviance, AIC).
- [2] arXiv:2403.17982 [pdf, ps, other]
-
Title: Markov chain models for inspecting response dynamics in psychological testingAuthors: Andrea BoscoComments: 20 pages, 1 figure, 3 tables, 25 equations/matrices. Part of this paper was presented to the XXIX AIP Congress, Experimental Psychology Section. September 18th-20th 2023, Lucca, Italy. Title of the talk: "Differentiating students with signs of ADHD or OCD based on hysteresis in responses to a mind-wandering test. A Study of Markov Chain Test Response Sequences"Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Probability (math.PR)
The importance of considering contextual probabilities in shaping response patterns within psychological testing is underscored, despite the ubiquitous nature of order effects discussed extensively in methodological literature. Drawing from concepts such as path-dependency, first-order autocorrelation, state-dependency, and hysteresis, the present study is an attempt to address how earlier responses serve as an anchor for subsequent answers in tests, surveys, and questionnaires. Introducing the notion of non-commuting observables derived from quantum physics, I highlight their role in characterizing psychological processes and the impact of measurement instruments on participants' responses. We advocate for the utilization of first-order Markov chain modeling to capture and forecast sequential dependencies in survey and test responses. The employment of the first-order Markov chain model lies in individuals' propensity to exhibit partial focus to preceding responses, with recent items most likely exerting a substantial influence on subsequent response selection. This study contributes to advancing our understanding of the dynamics inherent in sequential data within psychological research and provides a methodological framework for conducting longitudinal analyses of response patterns of test and questionnaire.
- [3] arXiv:2403.17986 [pdf, other]
-
Title: Comment on "Safe Testing" by Grünwald, de Heide, and KoolenAuthors: Joris MulderComments: 2 pages, 1 figure; comment on arXiv:1906.07801Subjects: Methodology (stat.ME)
This comment briefly reflects on "Safe Testing" by Gr\"{u}wald et al. (2024). The safety of fractional Bayes factors (O'Hagan, 1995) is illustrated and compared to (safe) Bayes factors based on the right Haar prior.
- [4] arXiv:2403.18039 [pdf, other]
-
Title: Doubly robust causal inference through penalized bias-reduced estimation: combining non-probability samples with designed surveysSubjects: Methodology (stat.ME)
Causal inference on the average treatment effect (ATE) using non-probability samples, such as electronic health records (EHR), faces challenges from sample selection bias and high-dimensional covariates. This requires considering a selection model alongside treatment and outcome models that are typical ingredients in causal inference. This paper considers integrating large non-probability samples with external probability samples from a design survey, addressing moderately high-dimensional confounders and variables that influence selection. In contrast to the two-step approach that separates variable selection and debiased estimation, we propose a one-step plug-in doubly robust (DR) estimator of the ATE. We construct a novel penalized estimating equation by minimizing the squared asymptotic bias of the DR estimator. Our approach facilitates ATE inference in high-dimensional settings by ignoring the variability in estimating nuisance parameters, which is not guaranteed in conventional likelihood approaches with non-differentiable L1-type penalties. We provide a consistent variance estimator for the DR estimator. Simulation studies demonstrate the double robustness of our estimator under misspecification of either the outcome model or the selection and treatment models, as well as the validity of statistical inference under penalized estimation. We apply our method to integrate EHR data from the Michigan Genomics Initiative with an external probability sample.
- [5] arXiv:2403.18054 [pdf, other]
-
Title: Modifying Gibbs sampling to avoid self transitionsAuthors: Radford M. NealSubjects: Computation (stat.CO); Computational Physics (physics.comp-ph)
Gibbs sampling repeatedly samples from the conditional distribution of one variable, x_i, given other variables, either choosing i randomly, or updating sequentially using some systematic or random order. When x_i is discrete, a Gibbs sampling update may choose a new value that is the same as the old value. A theorem of Peskun indicates that, when i is chosen randomly, a reversible method that reduces the probability of such self transitions, while increasing the probabilities of transitioning to each of the other values, will decrease the asymptotic variance of estimates. This has inspired two modified Gibbs sampling methods, originally due to Frigessi, et al and to Liu, though these do not always reduce self transitions to the minimum possible. Methods that do reduce the probability of self transitions to the minimum, but do not satisfy the conditions of Peskun's theorem, have also been devised, by Suwa and Todo. I review past methods, and introduce a broader class of reversible methods, based on what I call "antithetic modification", which also reduce asymptotic variance compared to Gibbs sampling, even when not satisfying the conditions of Peskun's theorem. A modification of one method in this class reduces self transitions to the minimum possible, while still always reducing asymptotic variance compared to Gibbs sampling. I introduce another new class of non-reversible methods based on slice sampling that can also minimize self transition probabilities. I provide explicit, efficient implementations of all these methods, and compare their performance in simulations of a 2D Potts model, a Bayesian mixture model, and a belief network with unobserved variables. The non-reversibility produced by sequential updating can be beneficial, but no consistent benefit is seen from the individual updates being done by a non-reversible method.
- [6] arXiv:2403.18069 [pdf, other]
-
Title: Personalized Imputation in metric spaces via conformal prediction: Applications in Predicting Diabetes Development with Continuous Glucose Monitoring InformationSubjects: Methodology (stat.ME); Applications (stat.AP)
The challenge of handling missing data is widespread in modern data analysis, particularly during the preprocessing phase and in various inferential modeling tasks. Although numerous algorithms exist for imputing missing data, the assessment of imputation quality at the patient level often lacks personalized statistical approaches. Moreover, there is a scarcity of imputation methods for metric space based statistical objects. The aim of this paper is to introduce a novel two-step framework that comprises: (i) a imputation methods for statistical objects taking values in metrics spaces, and (ii) a criterion for personalizing imputation using conformal inference techniques. This work is motivated by the need to impute distributional functional representations of continuous glucose monitoring (CGM) data within the context of a longitudinal study on diabetes, where a significant fraction of patients do not have available CGM profiles. The importance of these methods is illustrated by evaluating the effectiveness of CGM data as new digital biomarkers to predict the time to diabetes onset in healthy populations. To address these scientific challenges, we propose: (i) a new regression algorithm for missing responses; (ii) novel conformal prediction algorithms tailored for metric spaces with a focus on density responses within the 2-Wasserstein geometry; (iii) a broadly applicable personalized imputation method criterion, designed to enhance both of the aforementioned strategies, yet valid across any statistical model and data structure. Our findings reveal that incorporating CGM data into diabetes time-to-event analysis, augmented with a novel personalization phase of imputation, significantly enhances predictive accuracy by over ten percent compared to traditional predictive models for time to diabetes.
- [7] arXiv:2403.18072 [pdf, other]
-
Title: Goal-Oriented Bayesian Optimal Experimental Design for Nonlinear Models using Markov Chain Monte CarloSubjects: Computation (stat.CO); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Optimal experimental design (OED) provides a systematic approach to quantify and maximize the value of experimental data. Under a Bayesian approach, conventional OED maximizes the expected information gain (EIG) on model parameters. However, we are often interested in not the parameters themselves, but predictive quantities of interest (QoIs) that depend on the parameters in a nonlinear manner. We present a computational framework of predictive goal-oriented OED (GO-OED) suitable for nonlinear observation and prediction models, which seeks the experimental design providing the greatest EIG on the QoIs. In particular, we propose a nested Monte Carlo estimator for the QoI EIG, featuring Markov chain Monte Carlo for posterior sampling and kernel density estimation for evaluating the posterior-predictive density and its Kullback-Leibler divergence from the prior-predictive. The GO-OED design is then found by maximizing the EIG over the design space using Bayesian optimization. We demonstrate the effectiveness of the overall nonlinear GO-OED method, and illustrate its differences versus conventional non-GO-OED, through various test problems and an application of sensor placement for source inversion in a convection-diffusion field.
- [8] arXiv:2403.18115 [pdf, other]
-
Title: Assessing COVID-19 Vaccine Effectiveness in Observational Studies via Nested Trial EmulationComments: 27 pages, 2 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
Observational data are often used to estimate real-world effectiveness and durability of coronavirus disease 2019 (COVID-19) vaccines. A sequence of nested trials can be emulated to draw inference from such data while minimizing selection bias, immortal time bias, and confounding. Typically, when nested trial emulation (NTE) is employed, effect estimates are pooled across trials to increase statistical efficiency. However, such pooled estimates may lack a clear interpretation when the treatment effect is heterogeneous across trials. In the context of COVID-19, vaccine effectiveness quite plausibly will vary over calendar time due to newly emerging variants of the virus. This manuscript considers a NTE inverse probability weighted estimator of vaccine effectiveness that may vary over calendar time, time since vaccination, or both. Statistical testing of the trial effect homogeneity assumption is considered. Simulation studies are presented examining the finite-sample performance of these methods under a variety of scenarios. The methods are used to estimate vaccine effectiveness against COVID-19 outcomes using observational data on over 120,000 residents of Abruzzo, Italy during 2021.
- [9] arXiv:2403.18216 [pdf, other]
-
Title: Minimax Optimal Fair Classification with Bounded Demographic DisparitySubjects: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG); Statistics Theory (math.ST)
Mitigating the disparate impact of statistical machine learning methods is crucial for ensuring fairness. While extensive research aims to reduce disparity, the effect of using a \emph{finite dataset} -- as opposed to the entire population -- remains unclear. This paper explores the statistical foundations of fair binary classification with two protected groups, focusing on controlling demographic disparity, defined as the difference in acceptance rates between the groups. Although fairness may come at the cost of accuracy even with infinite data, we show that using a finite sample incurs additional costs due to the need to estimate group-specific acceptance thresholds. We study the minimax optimal classification error while constraining demographic disparity to a user-specified threshold. To quantify the impact of fairness constraints, we introduce a novel measure called \emph{fairness-aware excess risk} and derive a minimax lower bound on this measure that all classifiers must satisfy. Furthermore, we propose FairBayes-DDP+, a group-wise thresholding method with an offset that we show attains the minimax lower bound. Our lower bound proofs involve several innovations. Experiments support that FairBayes-DDP+ controls disparity at the user-specified level, while being faster and having a more favorable fairness-accuracy tradeoff than several baselines.
- [10] arXiv:2403.18245 [pdf, other]
-
Title: LocalCop: An R package for local likelihood inference for conditional copulasComments: 6 pages, 2 figures; submitted to the Journal of Open Source Software (JOSS)Subjects: Computation (stat.CO); Methodology (stat.ME)
Conditional copulas models allow the dependence structure between multiple response variables to be modelled as a function of covariates. LocalCop (Acar & Lysy, 2024) is an R/C++ package for computationally efficient semiparametric conditional copula modelling using a local likelihood inference framework developed in Acar, Craiu, & Yao (2011), Acar, Craiu, & Yao (2013) and Acar, Czado, & Lysy (2019).
- [11] arXiv:2403.18255 [pdf, other]
-
Title: Statistical inference for multi-regime threshold Ornstein-Uhlenbeck processesSubjects: Statistics Theory (math.ST)
In this paper, we investigate the parameter estimation for threshold Ornstein$\mathit{-}$Uhlenbeck processes. Least squares method is used to obtain continuous-type and discrete-type estimators for the drift parameters based on continuous and discrete observations, respectively. The strong consistency and asymptotic normality of the proposed least squares estimators are studied. We also propose a modified quadratic variation estimator based on the long-time observations for the diffusion parameters and prove its consistency. Our simulation results suggest that the performance of our proposed estimators for the drift parameters may show improvements compared to generalized moment estimators. Additionally, the proposed modified quadratic variation estimator exhibits potential advantages over the usual quadratic variation estimator with relatively small sample sizes. In particular, our method can be applied to the multi-regime cases ($m>2$), while the generalized moment method only deals with the two regime cases ($m=2$). The U.S. treasury rate data is used to illustrate the theoretical results.
- [12] arXiv:2403.18269 [pdf, other]
-
Title: Clustering Change Sign Detection by Fusing Mixture ComplexityComments: 23 pagesSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
This paper proposes an early detection method for cluster structural changes. Cluster structure refers to discrete structural characteristics, such as the number of clusters, when data are represented using finite mixture models, such as Gaussian mixture models. We focused on scenarios in which the cluster structure gradually changed over time. For finite mixture models, the concept of mixture complexity (MC) measures the continuous cluster size by considering the cluster proportion bias and overlap between clusters. In this paper, we propose MC fusion as an extension of MC to handle situations in which multiple mixture numbers are possible in a finite mixture model. By incorporating the fusion of multiple models, our approach accurately captured the cluster structure during transitional periods of gradual change. Moreover, we introduce a method for detecting changes in the cluster structure by examining the transition of MC fusion. We demonstrate the effectiveness of our method through empirical analysis using both artificial and real-world datasets.
- [13] arXiv:2403.18353 [pdf, other]
-
Title: Early Stopping for Ensemble Kalman-Bucy InversionAuthors: Maia TienstraSubjects: Statistics Theory (math.ST)
Bayesian linear inverse problems aim to recover an unknown signal from noisy observations, incorporating prior knowledge. This paper analyses a data dependent method to choose the scale parameter of a Gaussian prior. The method we study arises from early stopping methods, which have been successfully applied to a range of problems for statistical inverse problems in the frequentist setting. These results are extended to the Bayesian setting. We study the use of a discrepancy based stopping rule in the setting of random noise. Our proposed stopping rule results in optimal rates under certain conditions on the prior covariance operator. We furthermore derive for which class of signals this method is adaptive. It is also shown that the associated posterior contracts at the optimal rate and provides a conservative measure of uncertainty. We implement the proposed stopping rule using the continuous-time ensemble Kalman--Bucy filter (EnKBF). The fictitious time parameter replaces the scale parameter, and the ensemble size is appropriately adjusted in order to not lose statistical optimality of the computed estimator. The EnKBF, then, gives a continuous process from the prior distribution to the posterior which is terminated using the proposed stopping rule.
- [14] arXiv:2403.18355 [pdf, other]
-
Title: Supervised Multiple Kernel Learning approaches for multi-omics data integrationAuthors: Mitja Briscik (IMT), Gabriele Tazza, Marie-Agnes Dillies, László Vidács, Sébastien Dejean (IMT)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
Advances in high-throughput technologies have originated an ever-increasing availability of omics datasets. The integration of multiple heterogeneous data sources is currently an issue for biology and bioinformatics. Multiple kernel learning (MKL) has shown to be a flexible and valid approach to consider the diverse nature of multi-omics inputs, despite being an underused tool in genomic data mining.We provide novel MKL approaches based on different kernel fusion strategies.To learn from the meta-kernel of input kernels, we adaptedunsupervised integration algorithms for supervised tasks with support vector machines.We also tested deep learning architectures for kernel fusion and classification.The results show that MKL-based models can compete with more complex, state-of-the-art, supervised multi-omics integrative approaches. Multiple kernel learning offers a natural framework for predictive models in multi-omics genomic data. Our results offer a direction for bio-data mining research and further development of methods for heterogeneous data integration.
- [15] arXiv:2403.18357 [pdf, ps, other]
-
Title: Minimax density estimation in the adversarial framework under local differential privacyAuthors: Mélisande Albert (IMT, INSA Toulouse), Juliette Chevallier (IMT, INSA Toulouse), Béatrice Laurent (INSA Toulouse, IMT), Ousmane Sacko (UPN, MODAL'X)Subjects: Statistics Theory (math.ST)
We consider the problem of nonparametric density estimation under privacy constraints in an adversarial framework. To this end, we study minimax rates under local differential privacy over Sobolev spaces. We first obtain a lower bound which allows us to quantify the impact of privacy compared with the classical framework. Next, we introduce a new Coordinate block privacy mechanism that guarantees local differential privacy, which, coupled with a projection estimator, achieves the minimax optimal rates.
- [16] arXiv:2403.18432 [pdf, other]
-
Title: Poisson Regression in one Covariate on Massive DataComments: 16 pages, 11 figuresSubjects: Statistics Theory (math.ST)
The goal of subsampling is to select an informative subset of all observations, when using the full data for statistical analysis is not viable. We construct locally $ D $-optimal subsampling designs under a Poisson regression model with a log link in one covariate. A Representation of the support of locally $ D $-optimal subsampling designs is established. We make statements on scale-location transformations of the covariate that require a simultaneous transformation of the regression parameter. The performance of the methods is demonstrated by illustrating examples. To show the advantage of the optimal subsampling designs, we examine the efficiency of uniform random subsampling as well as of two heuristic designs. Further, the efficiency of locally $ D $-optimal subsampling designs is studied when the parameter is misspecified.
- [17] arXiv:2403.18464 [pdf, other]
-
Title: Cumulative Incidence Function Estimation Based on Population-Based Biobank DataSubjects: Methodology (stat.ME)
Many countries have established population-based biobanks, which are being used increasingly in epidemiolgical and clinical research. These biobanks offer opportunities for large-scale studies addressing questions beyond the scope of traditional clinical trials or cohort studies. However, using biobank data poses new challenges. Typically, biobank data is collected from a study cohort recruited over a defined calendar period, with subjects entering the study at various ages falling between $c_L$ and $c_U$. This work focuses on biobank data with individuals reporting disease-onset age upon recruitment, termed prevalent data, along with individuals initially recruited as healthy, and their disease onset observed during the follow-up period. We propose a novel cumulative incidence function (CIF) estimator that efficiently incorporates prevalent cases, in contrast to existing methods, providing two advantages: (1) increased efficiency, and (2) CIF estimation for ages before the lower limit, $c_L$.
- [18] arXiv:2403.18540 [pdf, other]
-
Title: skscope: Fast Sparsity-Constrained Optimization in PythonAuthors: Zezhi Wang, Jin Zhu, Peng Chen, Huiyang Peng, Xiaoke Zhang, Anran Wang, Yu Zheng, Junxian Zhu, Xueqin WangComments: 4 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
Applying iterative solvers on sparsity-constrained optimization (SCO) requires tedious mathematical deduction and careful programming/debugging that hinders these solvers' broad impact. In the paper, the library skscope is introduced to overcome such an obstacle. With skscope, users can solve the SCO by just programming the objective function. The convenience of skscope is demonstrated through two examples in the paper, where sparse linear regression and trend filtering are addressed with just four lines of code. More importantly, skscope's efficient implementation allows state-of-the-art solvers to quickly attain the sparse solution regardless of the high dimensionality of parameter space. Numerical experiments reveal the available solvers in skscope can achieve up to 80x speedup on the competing relaxation solutions obtained via the benchmarked convex solver. skscope is published on the Python Package Index (PyPI) and Conda, and its source code is available at: https://github.com/abess-team/skscope.
- [19] arXiv:2403.18549 [pdf, other]
-
Title: A communication-efficient, online changepoint detection method for monitoring distributed sensor networksComments: 36 pages, 8 figures, 5 tables, accepted by Statistics and ComputingSubjects: Methodology (stat.ME)
We consider the challenge of efficiently detecting changes within a network of sensors, where we also need to minimise communication between sensors and the cloud. We propose an online, communication-efficient method to detect such changes. The procedure works by performing likelihood ratio tests at each time point, and two thresholds are chosen to filter unimportant test statistics and make decisions based on the aggregated test statistics respectively. We provide asymptotic theory concerning consistency and the asymptotic distribution if there are no changes. Simulation results suggest that our method can achieve similar performance to the idealised setting, where we have no constraints on communication between sensors, but substantially reduce the transmission costs.
- [20] arXiv:2403.18578 [pdf, other]
-
Title: SteinGen: Generating Fidelitous and Diverse Graph SamplesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Generating graphs that preserve characteristic structures while promoting sample diversity can be challenging, especially when the number of graph observations is small. Here, we tackle the problem of graph generation from only one observed graph. The classical approach of graph generation from parametric models relies on the estimation of parameters, which can be inconsistent or expensive to compute due to intractable normalisation constants. Generative modelling based on machine learning techniques to generate high-quality graph samples avoids parameter estimation but usually requires abundant training samples. Our proposed generating procedure, SteinGen, which is phrased in the setting of graphs as realisations of exponential random graph models, combines ideas from Stein's method and MCMC by employing Markovian dynamics which are based on a Stein operator for the target model. SteinGen uses the Glauber dynamics associated with an estimated Stein operator to generate a sample, and re-estimates the Stein operator from the sample after every sampling step. We show that on a class of exponential random graph models this novel "estimation and re-estimation" generation strategy yields high distributional similarity (high fidelity) to the original data, combined with high sample diversity.
- [21] arXiv:2403.18602 [pdf, other]
-
Title: Collaborative graphical lassoSubjects: Methodology (stat.ME); Molecular Networks (q-bio.MN)
In recent years, the availability of multi-omics data has increased substantially. Multi-omics data integration methods mainly aim to leverage different molecular data sets to gain a complete molecular description of biological processes. An attractive integration approach is the reconstruction of multi-omics networks. However, the development of effective multi-omics network reconstruction strategies lags behind. This hinders maximizing the potential of multi-omics data sets. With this study, we advance the frontier of multi-omics network reconstruction by introducing "collaborative graphical lasso" as a novel strategy. Our proposed algorithm synergizes "graphical lasso" with the concept of "collaboration", effectively harmonizing multi-omics data sets integration, thereby enhancing the accuracy of network inference. Besides, to tackle model selection in this framework, we designed an ad hoc procedure based on network stability. We assess the performance of collaborative graphical lasso and the corresponding model selection procedure through simulations, and we apply them to publicly available multi-omics data. This demonstrated collaborative graphical lasso is able to reconstruct known biological connections and suggest previously unknown and biologically coherent interactions, enabling the generation of novel hypotheses. We implemented collaborative graphical lasso as an R package, available on CRAN as coglasso.
- [22] arXiv:2403.18658 [pdf, ps, other]
-
Title: Theoretical Guarantees for the Subspace-Constrained Tyler's EstimatorSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
This work analyzes the subspace-constrained Tyler's estimator (STE) designed for recovering a low-dimensional subspace within a dataset that may be highly corrupted with outliers. It assumes a weak inlier-outlier model and allows the fraction of inliers to be smaller than a fraction that leads to computational hardness of the robust subspace recovery problem. It shows that in this setting, if the initialization of STE, which is an iterative algorithm, satisfies a certain condition, then STE can effectively recover the underlying subspace. It further shows that under the generalized haystack model, STE initialized by the Tyler's M-estimator (TME), can recover the subspace when the fraction of iniliers is too small for TME to handle.
- [23] arXiv:2403.18664 [pdf, other]
-
Title: Neural Network-Based Piecewise Survival ModelsComments: 7 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
In this paper, a family of neural network-based survival models is presented. The models are specified based on piecewise definitions of the hazard function and the density function on a partitioning of the time; both constant and linear piecewise definitions are presented, resulting in a family of four models. The models can be seen as an extension of the commonly used discrete-time and piecewise exponential models and thereby add flexibility to this set of standard models. Using a simulated dataset the models are shown to perform well compared to the highly expressive, state-of-the-art energy-based model, while only requiring a fraction of the computation time.
- [24] arXiv:2403.18782 [pdf, ps, other]
-
Title: Beyond boundaries: Gary Lorden's groundbreaking contributions to sequential analysisSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
Gary Lorden provided a number of fundamental and novel insights to sequential hypothesis testing and changepoint detection. In this article we provide an overview of Lorden's contributions in the context of existing results in those areas, and some extensions made possible by Lorden's work, mentioning also areas of application including threat detection in physical-computer systems, near-Earth space informatics, epidemiology, clinical trials, and finance.
Cross-lists for Thu, 28 Mar 24
- [25] arXiv:2403.17978 (cross-list from cs.CR) [pdf, other]
-
Title: Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware DetectionComments: To appear in Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024, Valencia, SpainSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Malware detection is an interesting and valuable domain to work in because it has significant real-world impact and unique machine-learning challenges. We investigate existing long-range techniques and benchmarks and find that they're not very suitable in this problem area. In this paper, we introduce Holographic Global Convolutional Networks (HGConv) that utilize the properties of Holographic Reduced Representations (HRR) to encode and decode features from sequence elements. Unlike other global convolutional methods, our method does not require any intricate kernel computation or crafted kernel design. HGConv kernels are defined as simple parameters learned through backpropagation. The proposed method has achieved new SOTA results on Microsoft Malware Classification Challenge, Drebin, and EMBER malware benchmarks. With log-linear complexity in sequence length, the empirical results demonstrate substantially faster run-time by HGConv compared to other methods achieving far more efficient scaling even with sequence length $\geq 100,000$.
- [26] arXiv:2403.18124 (cross-list from math.OC) [pdf, other]
-
Title: Stochastic Finite Volume Method for Uncertainty Management in Gas Pipeline Network FlowsSubjects: Optimization and Control (math.OC); Computation (stat.CO)
Natural gas consumption by users of pipeline networks is subject to increasing uncertainty that originates from the intermittent nature of electric power loads serviced by gas-fired generators. To enable computationally efficient optimization of gas network flows subject to uncertainty, we develop a finite volume representation of stochastic solutions of hyperbolic partial differential equation (PDE) systems on graph-connected domains with nodal coupling and boundary conditions. The representation is used to express the physical constraints in stochastic optimization problems for gas flow allocation subject to uncertain parameters. The method is based on the stochastic finite volume approach that was recently developed for uncertainty quantification in transient flows represented by hyperbolic PDEs on graphs. In this study, we develop optimization formulations for steady-state gas flow over actuated transport networks subject to probabilistic constraints. In addition to the distributions for the physical solutions, we examine the dual variables that are produced by way of the optimization, and interpret them as price distributions that quantify the financial volatility that arises through demand uncertainty modeled in an optimization-driven gas market mechanism. We demonstrate the computation and distributional analysis using a single-pipe example and a small test network.
- [27] arXiv:2403.18127 (cross-list from cs.LG) [pdf, ps, other]
-
Title: A Correction of Pseudo Log-Likelihood MethodComments: 7 pagesSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Pseudo log-likelihood is a type of maximum likelihood estimation (MLE) method used in various fields including contextual bandits, influence maximization of social networks, and causal bandits. However, in previous literature \citep{li2017provably, zhang2022online, xiong2022combinatorial, feng2023combinatorial1, feng2023combinatorial2}, the log-likelihood function may not be bounded, which may result in the algorithm they proposed not well-defined. In this paper, we give a counterexample that the maximum pseudo log-likelihood estimation fails and then provide a solution to correct the algorithms in \citep{li2017provably, zhang2022online, xiong2022combinatorial, feng2023combinatorial1, feng2023combinatorial2}.
- [28] arXiv:2403.18219 (cross-list from cs.LG) [pdf, ps, other]
-
Title: From Two-Dimensional to Three-Dimensional Environment with Q-Learning: Modeling Autonomous Navigation with Reinforcement Learning and no LibrariesAuthors: Ergon Cugler de Moraes SilvaSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO)
Reinforcement learning (RL) algorithms have become indispensable tools in artificial intelligence, empowering agents to acquire optimal decision-making policies through interactions with their environment and feedback mechanisms. This study explores the performance of RL agents in both two-dimensional (2D) and three-dimensional (3D) environments, aiming to research the dynamics of learning across different spatial dimensions. A key aspect of this investigation is the absence of pre-made libraries for learning, with the algorithm developed exclusively through computational mathematics. The methodological framework centers on RL principles, employing a Q-learning agent class and distinct environment classes tailored to each spatial dimension. The research aims to address the question: How do reinforcement learning agents adapt and perform in environments of varying spatial dimensions, particularly in 2D and 3D settings? Through empirical analysis, the study evaluates agents' learning trajectories and adaptation processes, revealing insights into the efficacy of RL algorithms in navigating complex, multi-dimensional spaces. Reflections on the findings prompt considerations for future research, particularly in understanding the dynamics of learning in higher-dimensional environments.
- [29] arXiv:2403.18248 (cross-list from econ.EM) [pdf, other]
-
Title: Statistical Inference of Optimal Allocations I: Regularities and their ImplicationsSubjects: Econometrics (econ.EM); Machine Learning (stat.ML)
In this paper, we develp a functional differentiability approach for solving statistical optimal allocation problems. We first derive Hadamard differentiability of the value function through a detailed analysis of the general properties of the sorting operator. Central to our framework are the concept of Hausdorff measure and the area and coarea integration formulas from geometric measure theory. Building on our Hadamard differentiability results, we demonstrate how the functional delta method can be used to directly derive the asymptotic properties of the value function process for binary constrained optimal allocation problems, as well as the two-step ROC curve estimator. Moreover, leveraging profound insights from geometric functional analysis on convex and local Lipschitz functionals, we obtain additional generic Fr\'echet differentiability results for the value functions of optimal allocation problems. These compelling findings motivate us to study carefully the first order approximation of the optimal social welfare. In this paper, we then present a double / debiased estimator for the value functions. Importantly, the conditions outlined in the Hadamard differentiability section validate the margin assumption from the statistical classification literature employing plug-in methods that justifies a faster convergence rate.
- [30] arXiv:2403.18301 (cross-list from cs.LG) [pdf, other]
-
Title: Selective Mixup Fine-Tuning for Optimizing Non-Decomposable ObjectivesAuthors: Shrinivas Ramasubramanian, Harsh Rangwani, Sho Takemori, Kunal Samanta, Yuhei Umeda, Venkatesh Babu RadhakrishnanComments: ICLR 2024 SpotLightSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
The rise in internet usage has led to the generation of massive amounts of data, resulting in the adoption of various supervised and semi-supervised machine learning algorithms, which can effectively utilize the colossal amount of data to train models. However, before deploying these models in the real world, these must be strictly evaluated on performance measures like worst-case recall and satisfy constraints such as fairness. We find that current state-of-the-art empirical techniques offer sub-optimal performance on these practical, non-decomposable performance objectives. On the other hand, the theoretical techniques necessitate training a new model from scratch for each performance objective. To bridge the gap, we propose SelMix, a selective mixup-based inexpensive fine-tuning technique for pre-trained models, to optimize for the desired objective. The core idea of our framework is to determine a sampling distribution to perform a mixup of features between samples from particular classes such that it optimizes the given objective. We comprehensively evaluate our technique against the existing empirical and theoretically principled methods on standard benchmark datasets for imbalanced classification. We find that proposed SelMix fine-tuning significantly improves the performance for various practical non-decomposable objectives across benchmarks.
- [31] arXiv:2403.18430 (cross-list from cs.CL) [pdf, other]
-
Title: Exploring language relations through syntactic distances and geographic proximityComments: 36 pagesSubjects: Computation and Language (cs.CL); Data Analysis, Statistics and Probability (physics.data-an); Physics and Society (physics.soc-ph); Applications (stat.AP)
Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.
- [32] arXiv:2403.18668 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Aiming for RelevanceComments: 10 pages, 9 figures, AMIA Informatics 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
Vital signs are crucial in intensive care units (ICUs). They are used to track the patient's state and to identify clinically significant changes. Predicting vital sign trajectories is valuable for early detection of adverse events. However, conventional machine learning metrics like RMSE often fail to capture the true clinical relevance of such predictions. We introduce novel vital sign prediction performance metrics that align with clinical contexts, focusing on deviations from clinical norms, overall trends, and trend deviations. These metrics are derived from empirical utility curves obtained in a previous study through interviews with ICU clinicians. We validate the metrics' usefulness using simulated and real clinical datasets (MIMIC and eICU). Furthermore, we employ these metrics as loss functions for neural networks, resulting in models that excel in predicting clinically significant events. This research paves the way for clinically relevant machine learning model evaluation and optimization, promising to improve ICU patient care. 10 pages, 9 figures.
- [33] arXiv:2403.18685 (cross-list from cs.IT) [pdf, other]
-
Title: Representatividad Muestral en la Incertidumbre Simétrica Multivariada para la Selección de AtributosAuthors: Gustavo Sosa-CabreraComments: 52 pages, in Spanish. Advisors: Miguel Garc\'ia-Torres, Santiago G\'omez-Guerrero, Christian E. Schaerer SerraSubjects: Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
In this work, we analyze the behavior of the multivariate symmetric uncertainty (MSU) measure through the use of statistical simulation techniques under various mixes of informative and non-informative randomly generated features. Experiments show how the number of attributes, their cardinalities, and the sample size affect the MSU. In this thesis, through observation of results, it is proposed an heuristic condition that preserves good quality in the MSU under different combinations of these three factors, providing a new useful criterion to help drive the process of dimension reduction.
--
En el presente trabajo hemos analizado el comportamiento de una versi\'on multivariada de la incertidumbre sim\'etrica a trav\'es de t\'ecnicas de simulaci\'on estad\'isticas sobre varias combinaciones de atributos informativos y no-informativos generados de forma aleatoria. Los experimentos muestran como el n\'umero de atributos, sus cardinalidades y el tama\~no muestral afectan al MSU como medida. En esta tesis, mediante la observaci\'on de resultados hemos propuesto una condici\'on que preserva una buena calidad en el MSU bajo diferentes combinaciones de los tres factores mencionados, lo cual provee un nuevo y valioso criterio para llevar a cabo el proceso de reducci\'on de dimensionalidad. - [34] arXiv:2403.18717 (cross-list from cs.LG) [pdf, other]
-
Title: Semi-Supervised Learning for Deep Causal Generative ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Developing models that can answer questions of the form "How would $x$ change if $y$ had been $z$?" is fundamental for advancing medical image analysis. Training causal generative models that address such counterfactual questions, though, currently requires that all relevant variables have been observed and that corresponding labels are available in training data. However, clinical data may not have complete records for all patients and state of the art causal generative models are unable to take full advantage of this. We thus develop, for the first time, a semi-supervised deep causal generative model that exploits the causal relationships between variables to maximise the use of all available data. We explore this in the setting where each sample is either fully labelled or fully unlabelled, as well as the more clinically realistic case of having different labels missing for each sample. We leverage techniques from causal inference to infer missing values and subsequently generate realistic counterfactuals, even for samples with incomplete labels.
- [35] arXiv:2403.18739 (cross-list from cs.LG) [pdf, other]
-
Title: Usage-Specific Survival Modeling Based on Operational Data and Neural NetworksComments: 7 pagesSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
Accurate predictions of when a component will fail are crucial when planning maintenance, and by modeling the distribution of these failure times, survival models have shown to be particularly useful in this context. The presented methodology is based on conventional neural network-based survival models that are trained using data that is continuously gathered and stored at specific times, called snapshots. An important property of this type of training data is that it can contain more than one snapshot from a specific individual which results in that standard maximum likelihood training can not be directly applied since the data is not independent. However, the papers show that if the data is in a specific format where all snapshot times are the same for all individuals, called homogeneously sampled, maximum likelihood training can be applied and produce desirable results. In many cases, the data is not homogeneously sampled and in this case, it is proposed to resample the data to make it homogeneously sampled. How densely the dataset is sampled turns out to be an important parameter; it should be chosen large enough to produce good results, but this also increases the size of the dataset which makes training slow. To reduce the number of samples needed during training, the paper also proposes a technique to, instead of resampling the dataset once before the training starts, randomly resample the dataset at the start of each epoch during the training. The proposed methodology is evaluated on both a simulated dataset and an experimental dataset of starter battery failures. The results show that if the data is homogeneously sampled the methodology works as intended and produces accurate survival models. The results also show that randomly resampling the dataset on each epoch is an effective way to reduce the size of the training data.
Replacements for Thu, 28 Mar 24
- [36] arXiv:2109.05755 (replaced) [src]
-
Title: IQ: Intrinsic measure for quantifying the heterogeneity in meta-analysisComments: With a move comprehensive version with the new title "An alternative measure for quantifying the heterogeneity in meta-analysis", this old version is no longer most suitable to be posted in the arXiv. We hence will submit the new version with a new title as arXiv:2403.16706 and withdraw this outdated version. Thank you very much for your kind considerationSubjects: Methodology (stat.ME)
- [37] arXiv:2206.03975 (replaced) [pdf, other]
-
Title: Functional linear and single-index models: A unified approach via Gaussian Stein identityComments: To appear in Bernoulli JournalSubjects: Statistics Theory (math.ST)
- [38] arXiv:2207.07020 (replaced) [pdf, other]
-
Title: Estimating sparse direct effects in multivariate regression with the spike-and-slab LASSOSubjects: Methodology (stat.ME)
- [39] arXiv:2301.02505 (replaced) [pdf, other]
-
Title: Nested Dirichlet models for unsupervised attack pattern detection in honeypot dataAuthors: Francesco Sanna Passino, Anastasia Mantziou, Daniyar Ghani, Philip Thiede, Ross Bevington, Nicholas A. HeardSubjects: Cryptography and Security (cs.CR); Applications (stat.AP)
- [40] arXiv:2303.09817 (replaced) [pdf, other]
-
Title: Interpretable machine learning for time-to-event prediction in medicine and healthcareAuthors: Hubert Baniecki, Bartlomiej Sobieski, Patryk Szatkowski, Przemyslaw Bombinski, Przemyslaw BiecekComments: An extended version of an AIME 2023 paper submitted to Artificial Intelligence in MedicineJournal-ref: Artificial Intelligence in Medicine, vol. 1, pp. 65-74, 2023Subjects: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
- [41] arXiv:2305.02158 (replaced) [pdf, other]
-
Title: Shotgun crystal structure prediction using machine-learned formation energiesAuthors: Chang Liu (1), Hiromasa Tamaki (2), Tomoyasu Yokoyama (2), Kensuke Wakasugi (2), Satoshi Yotsuhashi (2), Minoru Kusaba (1), Ryo Yoshida (1, 3 and 4) ((1) The Institute of Statistical Mathematics, (2) Panasonic Holdings Corporation, (3) National Institute for Materials Science, (4) The Graduate University for Advanced Studies)Subjects: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (stat.ML)
- [42] arXiv:2306.10594 (replaced) [pdf, other]
- [43] arXiv:2306.13829 (replaced) [pdf, other]
-
Title: Selective inference using randomized group lasso estimators for general modelsComments: 64pages, 4 figures, 3 tablesSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
- [44] arXiv:2306.15173 (replaced) [pdf, other]
-
Title: Robust propensity score weighting estimation under missing at randomSubjects: Methodology (stat.ME)
- [45] arXiv:2306.15328 (replaced) [pdf, ps, other]
-
Title: Simulating counterfactualsSubjects: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG); Computation (stat.CO)
- [46] arXiv:2307.00567 (replaced) [pdf, other]
-
Title: A Note on Ising Network Analysis with Missing DataSubjects: Methodology (stat.ME)
- [47] arXiv:2307.01315 (replaced) [pdf, other]
-
Title: A log-linear model for non-stationary time series of countsSubjects: Statistics Theory (math.ST)
- [48] arXiv:2307.09713 (replaced) [pdf, other]
-
Title: Non-parametric inference on calibration of predicted risksComments: 15 pages (including 2 appendices), 5 figures, 0 tablesSubjects: Methodology (stat.ME); Applications (stat.AP)
- [49] arXiv:2308.11138 (replaced) [pdf, ps, other]
-
Title: NLP-based detection of systematic anomalies among the narratives of consumer complaintsSubjects: Methodology (stat.ME); Computation and Language (cs.CL); Risk Management (q-fin.RM); Machine Learning (stat.ML)
- [50] arXiv:2309.04381 (replaced) [pdf, other]
-
Title: Generalization Bounds: Perspectives from Information Theory and PAC-BayesComments: 228 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
- [51] arXiv:2309.10978 (replaced) [pdf, ps, other]
-
Title: Negative Spillover: A Potential Source of Bias in Pragmatic Clinical TrialsAuthors: Sean MannComments: 6.5 pages of main text, 2 figures, 1 table; New version with title change and minor edits to main textSubjects: Methodology (stat.ME); Quantitative Methods (q-bio.QM)
- [52] arXiv:2310.10900 (replaced) [pdf, other]
-
Title: Stability of Sequential Lateration and of Stress Minimization in the Presence of NoiseComments: arXiv admin note: substantial text overlap with arXiv:2207.07218Subjects: Statistics Theory (math.ST); Networking and Internet Architecture (cs.NI); Probability (math.PR)
- [53] arXiv:2310.11471 (replaced) [pdf, other]
-
Title: Modeling lower-truncated and right-censored insurance claims with an extension of the MBBEFD classComments: 36 pagesSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
- [54] arXiv:2310.13580 (replaced) [pdf, other]
-
Title: Bayesian Hierarchical Modeling for Bivariate Multiscale Spatial Data with Application to Blood Test MonitoringSubjects: Methodology (stat.ME); Applications (stat.AP)
- [55] arXiv:2310.16502 (replaced) [pdf, other]
-
Title: Assessing the overall and partial causal well-specification of nonlinear additive noise modelsSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
- [56] arXiv:2311.06373 (replaced) [pdf, other]
-
Title: Partial Information Decomposition for Continuous Variables based on Shared Exclusions: Analytical Formulation and EstimationAuthors: David A. Ehrlich, Kyle Schick-Poland, Abdullah Makkeh, Felix Lanfermann, Patricia Wollstadt, Michael WibralComments: 32 pages, 15 figuresSubjects: Information Theory (cs.IT); Probability (math.PR); Statistics Theory (math.ST); Computation (stat.CO)
- [57] arXiv:2311.08214 (replaced) [pdf, ps, other]
-
Title: Frequentist Guarantees of Distributed (Non)-Bayesian InferenceSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
- [58] arXiv:2312.12558 (replaced) [pdf, other]
-
Title: Sample Efficient Reinforcement Learning with Partial Dynamics KnowledgeComments: Published in the 38th Annual AAAI Conference on Artificial IntelligenceSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
- [59] arXiv:2401.00624 (replaced) [pdf, other]
-
Title: Semi-Confirmatory Factor Analysis for High-Dimensional Data with Interconnected Community StructuresSubjects: Methodology (stat.ME)
- [60] arXiv:2401.16749 (replaced) [pdf, other]
-
Title: Bayesian scalar-on-network regression with applications to brain functional connectivitySubjects: Methodology (stat.ME)
- [61] arXiv:2402.07868 (replaced) [pdf, ps, other]
-
Title: Nesting Particle Filters for Experimental Design in Dynamical SystemsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
- [62] arXiv:2402.09654 (replaced) [pdf, other]
-
Title: GPT-4's assessment of its performance in a USMLE-based case studyAuthors: Uttam Dhakal, Aniket Kumar Singh, Suman Devkota, Yogesh Sapkota, Bishal Lamichhane, Suprinsa Paudyal, Chandra DhakalSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
- [63] arXiv:2403.07236 (replaced) [pdf, other]
-
Title: Partial Identification of Individual-Level Parameters Using Aggregate Data in a Nonparametric Binary Outcome ModelAuthors: Sarah MoonSubjects: Econometrics (econ.EM); Methodology (stat.ME)
- [64] arXiv:2403.15198 (replaced) [pdf, ps, other]
-
Title: On the Weighted Top-Difference Distance: Axioms, Aggregation, and ApproximationComments: 64 pagesSubjects: Computer Science and Game Theory (cs.GT); Discrete Mathematics (cs.DM); Theoretical Economics (econ.TH); Methodology (stat.ME)
- [65] arXiv:2403.16121 (replaced) [pdf, other]
-
Title: Log-rank test with coarsened exact matchingSubjects: Statistics Theory (math.ST)
- [66] arXiv:2403.16828 (replaced) [pdf, other]
-
Title: Asymptotics of predictive distributions driven by sample means and variancesSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
- [67] arXiv:2403.17481 (replaced) [pdf, ps, other]
- [68] arXiv:2403.17767 (replaced) [pdf, ps, other]
-
Title: Asymptotic Bayes risk of semi-supervised learning with uncertain labelingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[ showing up to 500 entries per page: fewer | more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2403, contact, help (Access key information)