We gratefully acknowledge support from
the Simons Foundation and member institutions.

Machine Learning

New submissions

[ total of 96 entries: 1-96 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Tue, 19 Oct 21

[1]  arXiv:2110.08418 [pdf, ps, other]
Title: Nuances in Margin Conditions Determine Gains in Active Learning
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We consider nonparametric classification with smooth regression functions, where it is well known that notions of margin in $E[Y|X]$ determine fast or slow rates in both active and passive learning. Here we elucidate a striking distinction between the two settings. Namely, we show that some seemingly benign nuances in notions of margin -- involving the uniqueness of the Bayes classifier, and which have no apparent effect on rates in passive learning -- determine whether or not any active learner can outperform passive learning rates. In particular, for Audibert-Tsybakov's margin condition (allowing general situations with non-unique Bayes classifiers), no active learner can gain over passive learning in commonly studied settings where the marginal on $X$ is near uniform. Our results thus negate the usual intuition from past literature that active rates should improve over passive rates in nonparametric settings.

[2]  arXiv:2110.08449 [pdf, other]
Title: Adversarial Attacks on Gaussian Process Bandits
Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

Gaussian processes (GP) are a widely-adopted tool used to sequentially optimize black-box functions, where evaluations are costly and potentially noisy. Recent works on GP bandits have proposed to move beyond random noise and devise algorithms robust to adversarial attacks. In this paper, we study this problem from the attacker's perspective, proposing various adversarial attack methods with differing assumptions on the attacker's strength and prior information. Our goal is to understand adversarial attacks on GP bandits from both a theoretical and practical perspective. We focus primarily on targeted attacks on the popular GP-UCB algorithm and a related elimination-based algorithm, based on adversarially perturbing the function $f$ to produce another function $\tilde{f}$ whose optima are in some region $\mathcal{R}_{\rm target}$. Based on our theoretical analysis, we devise both white-box attacks (known $f$) and black-box attacks (unknown $f$), with the former including a Subtraction attack and Clipping attack, and the latter including an Aggressive subtraction attack. We demonstrate that adversarial attacks on GP bandits can succeed in forcing the algorithm towards $\mathcal{R}_{\rm target}$ even with a low attack budget, and we compare our attacks' performance and efficiency on several real and synthetic functions.

[3]  arXiv:2110.08500 [pdf, other]
Title: On Model Selection Consistency of Lasso for High-Dimensional Ising Models on Tree-like Graphs
Comments: 30 pages, 4 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

We consider the problem of high-dimensional Ising model selection using neighborhood-based least absolute shrinkage and selection operator (Lasso). It is rigorously proved that under some mild coherence conditions on the population covariance matrix of the Ising model, consistent model selection can be achieved with sample sizes $n=\Omega{(d^3\log{p})}$ for any tree-like graph in the paramagnetic phase, where $p$ is the number of variables and $d$ is the maximum node degree. When the same conditions are imposed directly on the sample covariance matrices, it is shown that a reduced sample size $n=\Omega{(d^2\log{p})}$ suffices. The obtained sufficient conditions for consistent model selection with Lasso are the same in the scaling of the sample complexity as that of $\ell_1$-regularized logistic regression. Given the popularity and efficiency of Lasso, our rigorous analysis provides a theoretical backing for its practical use in Ising model selection.

[4]  arXiv:2110.08505 [pdf, other]
Title: Mode and Ridge Estimation in Euclidean and Directional Product Spaces: A Mean Shift Approach
Comments: 51 pages, 10 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

The set of local modes and the ridge lines estimated from a dataset are important summary characteristics of the data-generating distribution. In this work, we consider estimating the local modes and ridges from point cloud data in a product space with two or more Euclidean/directional metric spaces. Specifically, we generalize the well-known (subspace constrained) mean shift algorithm to the product space setting and illuminate some pitfalls in such generalization. We derive the algorithmic convergence of the proposed method, provide practical guidelines on the implementation, and demonstrate its effectiveness on both simulated and real datasets.

[5]  arXiv:2110.08676 [pdf, other]
Title: Noise-Augmented Privacy-Preserving Empirical Risk Minimization with Dual-purpose Regularizer and Privacy Budget Retrieval and Recycling
Authors: Yinan Li, Fang Liu
Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

We propose Noise-Augmented Privacy-Preserving Empirical Risk Minimization (NAPP-ERM) that solves ERM with differential privacy guarantees. Existing privacy-preserving ERM approaches may be subject to over-regularization with the employment of an l2 term to achieve strong convexity on top of the target regularization. NAPP-ERM improves over the current approaches and mitigates over-regularization by iteratively realizing target regularization through appropriately designed augmented data and delivering strong convexity via a single adaptively weighted dual-purpose l2 regularizer. When the target regularization is for variable selection, we propose a new regularizer that achieves both privacy and sparsity guarantees simultaneously. Finally, we propose a strategy to retrieve privacy budget when the strong convexity requirement is met, which can be returned to users such that the DP of ERM is guaranteed at a lower privacy cost than originally planned, or be recycled to the ERM optimization procedure to reduce the injected DP noise and improve the utility of DP-ERM. From an implementation perspective, NAPP-ERM can be achieved by optimizing a non-perturbed object function given noise-augmented data and can thus leverage existing tools for non-private ERM optimization. We illustrate through extensive experiments the mitigation effect of the over-regularization and private budget retrieval by NAPP-ERM on variable selection and prediction.

[6]  arXiv:2110.08884 [pdf, other]
Title: Persuasion by Dimension Reduction
Comments: arXiv admin note: text overlap with arXiv:2102.10909
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); General Economics (econ.GN); Statistics Theory (math.ST); Methodology (stat.ME)

How should an agent (the sender) observing multi-dimensional data (the state vector) persuade another agent to take the desired action? We show that it is always optimal for the sender to perform a (non-linear) dimension reduction by projecting the state vector onto a lower-dimensional object that we call the "optimal information manifold." We characterize geometric properties of this manifold and link them to the sender's preferences. Optimal policy splits information into "good" and "bad" components. When the sender's marginal utility is linear, revealing the full magnitude of good information is always optimal. In contrast, with concave marginal utility, optimal information design conceals the extreme realizations of good information and only reveals its direction (sign). We illustrate these effects by explicitly solving several multi-dimensional Bayesian persuasion problems.

[7]  arXiv:2110.08936 [pdf, ps, other]
Title: Rejoinder: Learning Optimal Distributionally Robust Individualized Treatment Rules
Journal-ref: Journal of the American Statistical Association, 116:534, 699-707 (2021)
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We thank the opportunity offered by editors for this discussion and the discussants for their insightful comments and thoughtful contributions. We also want to congratulate Kallus (2020) for his inspiring work in improving the efficiency of policy learning by retargeting. Motivated from the discussion in Dukes and Vansteelandt (2020), we first point out interesting connections and distinctions between our work and Kallus (2020) in Section 1. In particular, the assumptions and sources of variation for consideration in these two papers lead to different research problems with different scopes and focuses. In Section 2, following the discussions in Li et al. (2020); Liang and Zhao (2020), we also consider the efficient policy evaluation problem when we have some data from the testing distribution available at the training stage. We show that under the assumption that the sample sizes from training and testing are growing in the same order, efficient value function estimates can deliver competitive performance. We further show some connections of these estimates with existing literature. However, when the growth of testing sample size available for training is in a slower order, efficient value function estimates may not perform well anymore. In contrast, the requirement of the testing sample size for DRITR is not as strong as that of efficient policy evaluation using the combined data. Finally, we highlight the general applicability and usefulness of DRITR in Section 3.

[8]  arXiv:2110.08989 [pdf, other]
Title: Valid and Exact Statistical Inference for Multi-dimensional Multiple Change-Points by Selective Inference
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

In this paper, we study statistical inference of change-points (CPs) in multi-dimensional sequence. In CP detection from a multi-dimensional sequence, it is often desirable not only to detect the location, but also to identify the subset of the components in which the change occurs. Several algorithms have been proposed for such problems, but no valid exact inference method has been established to evaluate the statistical reliability of the detected locations and components. In this study, we propose a method that can guarantee the statistical reliability of both the location and the components of the detected changes. We demonstrate the effectiveness of the proposed method by applying it to the problems of genomic abnormality identification and human behavior analysis.

[9]  arXiv:2110.09167 [pdf, other]
Title: RKHS-SHAP: Shapley Values for Kernel Methods
Comments: 11 pages, 4 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Feature attribution for kernel methods is often heuristic and not individualised for each prediction. To address this, we turn to the concept of Shapley values, a coalition game theoretical framework that has previously been applied to different machine learning model interpretation tasks, such as linear models, tree ensembles and deep networks. By analysing Shapley values from a functional perspective, we propose \textsc{RKHS-SHAP}, an attribution method for kernel machines that can efficiently compute both \emph{Interventional} and \emph{Observational Shapley values} using kernel mean embeddings of distributions. We show theoretically that our method is robust with respect to local perturbations - a key yet often overlooked desideratum for interpretability. Further, we propose \emph{Shapley regulariser}, applicable to a general empirical risk minimisation framework, allowing learning while controlling the level of specific feature's contributions to the model. We demonstrate that the Shapley regulariser enables learning which is robust to covariate shift of a given feature and fair learning which controls the Shapley values of sensitive features.

[10]  arXiv:2110.09360 [pdf, other]
Title: Prediction of liquid fuel properties using machine learning models with Gaussian processes and probabilistic conditional generative learning
Comments: 22 pages, 13 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Accurate determination of fuel properties of complex mixtures over a wide range of pressure and temperature conditions is essential to utilizing alternative fuels. The present work aims to construct cheap-to-compute machine learning (ML) models to act as closure equations for predicting the physical properties of alternative fuels. Those models can be trained using the database from MD simulations and/or experimental measurements in a data-fusion-fidelity approach. Here, Gaussian Process (GP) and probabilistic generative models are adopted. GP is a popular non-parametric Bayesian approach to build surrogate models mainly due to its capacity to handle the aleatory and epistemic uncertainties. Generative models have shown the ability of deep neural networks employed with the same intent. In this work, ML analysis is focused on a particular property, the fuel density, but it can also be extended to other physicochemical properties. This study explores the versatility of the ML models to handle multi-fidelity data. The results show that ML models can predict accurately the fuel properties of a wide range of pressure and temperature conditions.

[11]  arXiv:2110.09361 [pdf, other]
Title: Efficient Exploration in Binary and Preferential Bayesian Optimization
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)

Bayesian optimization (BO) is an effective approach to optimize expensive black-box functions, that seeks to trade-off between exploitation (selecting parameters where the maximum is likely) and exploration (selecting parameters where we are uncertain about the objective function). In many real-world situations, direct measurements of the objective function are not possible, and only binary measurements such as success/failure or pairwise comparisons are available. To perform efficient exploration in this setting, we show that it is important for BO algorithms to distinguish between different types of uncertainty: epistemic uncertainty, about the unknown objective function, and aleatoric uncertainty, which comes from noisy observations and cannot be reduced. In effect, only the former is important for efficient exploration. Based on this, we propose several new acquisition functions that outperform state-of-the-art heuristics in binary and preferential BO, while being fast to compute and easy to implement. We then generalize these acquisition rules to batch learning, where multiple queries are performed simultaneously.

Cross-lists for Tue, 19 Oct 21

[12]  arXiv:2008.08342 (cross-list from cond-mat.dis-nn) [pdf, other]
Title: Structure Learning in Inverse Ising Problems Using $\ell_2$-Regularized Linear Estimator
Comments: 35 pages, 8 figures
Subjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Machine Learning (stat.ML)

The inference performance of the pseudolikelihood method is discussed in the framework of the inverse Ising problem when the $\ell_2$-regularized (ridge) linear regression is adopted. This setup is introduced for theoretically investigating the situation where the data generation model is different from the inference one, namely the model mismatch situation. In the teacher-student scenario under the assumption that the teacher couplings are sparse, the analysis is conducted using the replica and cavity methods, with a special focus on whether the presence/absence of teacher couplings is correctly inferred or not. The result indicates that despite the model mismatch, one can perfectly identify the network structure using naive linear regression without regularization when the number of spins $N$ is smaller than the dataset size $M$, in the thermodynamic limit $N\to \infty$. Further, to access the underdetermined region $M < N$, we examine the effect of the $\ell_2$ regularization, and find that biases appear in all the coupling estimates, preventing the perfect identification of the network structure. We, however, find that the biases are shown to decay exponentially fast as the distance from the center spin chosen in the pseudolikelihood method grows. Based on this finding, we propose a two-stage estimator: In the first stage, the ridge regression is used and the estimates are pruned by a relatively small threshold; in the second stage the naive linear regression is conducted only on the remaining couplings, and the resultant estimates are again pruned by another relatively large threshold. This estimator with the appropriate regularization coefficient and thresholds is shown to achieve the perfect identification of the network structure even in $0<M/N<1$. Results of extensive numerical experiments support these findings.

[13]  arXiv:2110.08363 (cross-list from stat.AP) [pdf, other]
Title: Spatio-temporal extreme event modeling of terror insurgencies
Subjects: Applications (stat.AP); Machine Learning (stat.ML)

Extreme events with potential deadly outcomes, such as those organized by terror groups, are highly unpredictable in nature and an imminent threat to society. In particular, quantifying the likelihood of a terror attack occurring in an arbitrary space-time region and its relative societal risk, would facilitate informed measures that would strengthen national security. This paper introduces a novel self-exciting marked spatio-temporal model for attacks whose inhomogeneous baseline intensity is written as a function of covariates. Its triggering intensity is succinctly modeled with a Gaussian Process prior distribution to flexibly capture intricate spatio-temporal dependencies between an arbitrary attack and previous terror events. By inferring the parameters of this model, we highlight specific space-time areas in which attacks are likely to occur. Furthermore, by measuring the outcome of an attack in terms of the number of casualties it produces, we introduce a novel mixture distribution for the number of casualties. This distribution flexibly handles low and high number of casualties and the discrete nature of the data through a {\it Generalized ZipF} distribution. We rely on a customized Markov chain Monte Carlo (MCMC) method to estimate the model parameters. We illustrate the methodology with data from the open source Global Terrorism Database (GTD) that correspond to attacks in Afghanistan from 2013-2018. We show that our model is able to predict the intensity of future attacks for 2019-2021 while considering various covariates of interest such as population density, number of regional languages spoken, and the density of population supporting the opposing government.

[14]  arXiv:2110.08577 (cross-list from math.OC) [pdf, other]
Title: Nys-Curve: Nyström-Approximated Curvature for Stochastic Optimization
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)

The quasi-Newton methods generally provide curvature information by approximating the Hessian using the secant equation. However, the secant equation becomes insipid in approximating the Newton step owing to its use of the first-order derivatives. In this study, we propose an approximate Newton step-based stochastic optimization algorithm for large-scale empirical risk minimization of convex functions with linear convergence rates. Specifically, we compute a partial column Hessian of size ($d\times k$) with $k\ll d$ randomly selected variables, then use the \textit{Nystr\"om method} to better approximate the full Hessian matrix. To further reduce the computational complexity per iteration, we directly compute the update step ($\Delta\boldsymbol{w}$) without computing and storing the full Hessian or its inverse. Furthermore, to address large-scale scenarios in which even computing a partial Hessian may require significant time, we used distribution-preserving (DP) sub-sampling to compute a partial Hessian. The DP sub-sampling generates $p$ sub-samples with similar first and second-order distribution statistics and selects a single sub-sample at each epoch in a round-robin manner to compute the partial Hessian. We integrate our approximated Hessian with stochastic gradient descent and stochastic variance-reduced gradients to solve the logistic regression problem. The numerical experiments show that the proposed approach was able to obtain a better approximation of Newton\textquotesingle s method with performance competitive with the state-of-the-art first-order and the stochastic quasi-Newton methods.

[15]  arXiv:2110.08607 (cross-list from cs.LG) [pdf, other]
Title: Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty
Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Chaotic Dynamics (nlin.CD); Machine Learning (stat.ML)

In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework is especially targeted to the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where it is typically intractable to perform exact inference of latent variables. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space is often prone to lack physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modeling framework for unsupervised learning and identification for nonlinear dynamical systems. Specifically, the transition process can be modeled as a physics-based model enhanced with an additive neural network component, which aims to learn the discrepancy between the physics-based model and the actual dynamical system being monitored. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential to generalization and prediction capabilities.

[16]  arXiv:2110.08627 (cross-list from cs.LG) [pdf, other]
Title: On the Pareto Frontier of Regret Minimization and Best Arm Identification in Stochastic Bandits
Comments: 27 pages, 8 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)

We study the Pareto frontier of two archetypal objectives in stochastic bandits, namely, regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore that the balance between exploitation and exploration is crucial for both RM and BAI, but exploration is more critical in achieving the optimal performance for the latter objective. To make this precise, we first design and analyze the BoBW-lil'UCB$({\gamma})$ algorithm, which achieves order-wise optimal performance for RM or BAI under different values of ${\gamma}$. Complementarily, we show that no algorithm can simultaneously perform optimally for both the RM and BAI objectives. More precisely, we establish non-trivial lower bounds on the regret achievable by any algorithm with a given BAI failure probability. This analysis shows that in some regimes BoBW-lil'UCB$({\gamma})$ achieves Pareto-optimality up to constant or small terms. Numerical experiments further demonstrate that when applied to difficult instances, BoBW-lil'UCB outperforms a close competitor UCB$_{\alpha}$ (Degenne et al., 2019), which is designed for RM and BAI with a fixed confidence.

[17]  arXiv:2110.08634 (cross-list from cs.SD) [pdf, other]
Title: Towards Robust Waveform-Based Acoustic Models
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)

We propose an approach for learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. Our approach is an instance of vicinal risk minimization, which aims to improve risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We characterize the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus our evaluation on the waveform-based setting. Our empirical results show that the proposed approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances (i.e., optimal vicinal densities).

[18]  arXiv:2110.08678 (cross-list from cs.LG) [pdf, other]
Title: Transformer with a Mixture of Gaussian Keys
Comments: 21 pages, 8 figures, 4 tables
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

Multi-head attention is a driving force behind state-of-the-art transformers which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attentions. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.

[19]  arXiv:2110.08691 (cross-list from cs.DS) [pdf, ps, other]
Title: Terminal Embeddings in Sublinear Time
Comments: Accepted to FOCS 2021
Subjects: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Machine Learning (cs.LG); Machine Learning (stat.ML)

Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {\it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $T\subset X$. Such an embedding $f$ is said to have distortion $\rho\ge 1$ if $\rho$ is the smallest value such that there exists a constant $C>0$ satisfying
\begin{equation*}
\forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C \rho d_X(x, q) .
\end{equation*}
In the case that $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1+\epsilon$ is achievable via such a terminal embedding with $m = O(\epsilon^{-2}\log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside is that evaluating the embedding on some $q\in \mathbb{R}^d$ required solving a semidefinite program with $\Theta(n)$ constraints in $m$ variables and thus required some superlinear $\mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $q\in\mathbb{R}^d$ in sublinear time $n^{1-\Theta(\epsilon^2)+o(1)} + dn^{o(1)}$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.

[20]  arXiv:2110.08693 (cross-list from cs.LG) [pdf, other]
Title: On the Statistical Analysis of Complex Tree-shaped 3D Objects
Subjects: Machine Learning (cs.LG); Computational Geometry (cs.CG); Graphics (cs.GR); Machine Learning (stat.ML)

How can one analyze detailed 3D biological objects, such as neurons and botanical trees, that exhibit complex geometrical and topological variation? In this paper, we develop a novel mathematical framework for representing, comparing, and computing geodesic deformations between the shapes of such tree-like 3D objects. A hierarchical organization of subtrees characterizes these objects -- each subtree has the main branch with some side branches attached -- and one needs to match these structures across objects for meaningful comparisons. We propose a novel representation that extends the Square-Root Velocity Function (SRVF), initially developed for Euclidean curves, to tree-shaped 3D objects. We then define a new metric that quantifies the bending, stretching, and branch sliding needed to deform one tree-shaped object into the other. Compared to the current metrics, such as the Quotient Euclidean Distance (QED) and the Tree Edit Distance (TED), the proposed representation and metric capture the full elasticity of the branches (i.e., bending and stretching) as well as the topological variations (i.e., branch death/birth and sliding). It completely avoids the shrinkage that results from the edge collapse and node split operations of the QED and TED metrics. We demonstrate the utility of this framework in comparing, matching, and computing geodesics between biological objects such as neurons and botanical trees. The framework is also applied to various shape analysis tasks: (i) symmetry analysis and symmetrization of tree-shaped 3D objects, (ii) computing summary statistics (means and modes of variations) of populations of tree-shaped 3D objects, (iii) fitting parametric probability distributions to such populations, and (iv) finally synthesizing novel tree-shaped 3D objects through random sampling from estimated probability distributions.

[21]  arXiv:2110.08695 (cross-list from cs.LG) [pdf, other]
Title: Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
Comments: NeurIPS, 2021
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for finite-horizon MDPs. Prior works study this problem based on different data-coverage assumptions, and their learning guarantees are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches \[ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). \] In complementary, we also prove a per-instance information-theoretical lower bound under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Different from the previous minimax lower bounds, the per-instance lower bound (via local minimaxity) is a much stronger criterion as it applies to individual instances separately. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d_h^\mu$ is the marginal state-action probability. We call the above equation the intrinsic offline reinforcement learning bound since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the assumption-free regime (where we make no assumption on $ \mu$) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.

[22]  arXiv:2110.08710 (cross-list from cs.LG) [pdf, ps, other]
Title: NeuralArTS: Structuring Neural Architecture Search with Type Theory
Subjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL); Machine Learning (stat.ML)

Neural Architecture Search (NAS) algorithms automate the task of finding optimal deep learning architectures given an initial search space of possible operations. Developing these search spaces is usually a manual affair with pre-optimized search spaces being more efficient, rather than searching from scratch. In this paper we present a new framework called Neural Architecture Type System (NeuralArTS) that categorizes the infinite set of network operations in a structured type system. We further demonstrate how NeuralArTS can be applied to convolutional layers and propose several future directions.

[23]  arXiv:2110.08720 (cross-list from cs.LG) [pdf, other]
Title: Centroid Approximation for Bootstrap
Authors: Mao Ye, Qiang Liu
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Bootstrap is a principled and powerful frequentist statistical tool for uncertainty quantification. Unfortunately, standard bootstrap methods are computationally intensive due to the need of drawing a large i.i.d. bootstrap sample to approximate the ideal bootstrap distribution; this largely hinders their application in large-scale machine learning, especially deep learning problems. In this work, we propose an efficient method to explicitly \emph{optimize} a small set of high quality "centroid" points to better approximate the ideal bootstrap distribution. We achieve this by minimizing a simple objective function that is asymptotically equivalent to the Wasserstein distance to the ideal bootstrap distribution. This allows us to provide an accurate estimation of uncertainty with a small number of bootstrap centroids, outperforming the naive i.i.d. sampling approach. Empirically, we show that our method can boost the performance of bootstrap in a variety of applications.

[24]  arXiv:2110.08850 (cross-list from physics.soc-ph) [pdf]
Title: Understanding the network formation pattern for better link prediction
Comments: 21 pages, 3 figures, 18 tables, and 29 references
Subjects: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Molecular Networks (q-bio.MN); Machine Learning (stat.ML)

As a classical problem in the field of complex networks, link prediction has attracted much attention from researchers, which is of great significance to help us understand the evolution and dynamic development mechanisms of networks. Although various network type-specific algorithms have been proposed to tackle the link prediction problem, most of them suppose that the network structure is dominated by the Triadic Closure Principle. We still lack an adaptive and comprehensive understanding of network formation patterns for predicting potential links. In addition, it is valuable to investigate how network local information can be better utilized. To this end, we proposed a novel method named Link prediction using Multiple Order Local Information (MOLI) that exploits the local information from the neighbors of different distances, with parameters that can be a prior-driven based on prior knowledge, or data-driven by solving an optimization problem on observed networks. MOLI defined a local network diffusion process via random walks on the graph, resulting in better use of network information. We show that MOLI outperforms the other 11 widely used link prediction algorithms on 11 different types of simulated and real-world networks. We also conclude that there are different patterns of local information utilization for different networks, including social networks, communication networks, biological networks, etc. In particular, the classical common neighbor-based algorithm is not as adaptable to all social networks as it is perceived to be; instead, some of the social networks obey the Quadrilateral Closure Principle which preferentially connects paths of length three.

[25]  arXiv:2110.08871 (cross-list from cs.LG) [pdf, ps, other]
Title: Noise-robust Clustering
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

This paper presents noise-robust clustering techniques in unsupervised machine learning. The uncertainty about the noise, consistency, and other ambiguities can become severe obstacles in data analytics. As a result, data quality, cleansing, management, and governance remain critical disciplines when working with Big Data. With this complexity, it is no longer sufficient to treat data deterministically as in a classical setting, and it becomes meaningful to account for noise distribution and its impact on data sample values. Classical clustering methods group data into "similarity classes" depending on their relative distances or similarities in the underlying space. This paper addressed this problem via the extension of classical $K$-means and $K$-medoids clustering over data distributions (rather than the raw data). This involves measuring distances among distributions using two types of measures: the optimal mass transport (also called Wasserstein distance, denoted $W_2$) and a novel distance measure proposed in this paper, the expected value of random variable distance (denoted ED). The presented distribution-based $K$-means and $K$-medoids algorithms cluster the data distributions first and then assign each raw data to the cluster of data's distribution.

[26]  arXiv:2110.08922 (cross-list from cs.LG) [pdf, other]
Title: Explaining generalization in deep learning: progress and fundamental limits
Comments: arXiv admin note: text overlap with arXiv:1902.04742
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

This dissertation studies a fundamental open challenge in deep learning theory: why do deep networks generalize well even while being overparameterized, unregularized and fitting the training data to zero error?
In the first part of the thesis, we will empirically study how training deep networks via stochastic gradient descent implicitly controls the networks' capacity. Subsequently, to show how this leads to better generalization, we will derive {\em data-dependent} {\em uniform-convergence-based} generalization bounds with improved dependencies on the parameter count.
Uniform convergence has in fact been the most widely used tool in deep learning literature, thanks to its simplicity and generality. Given its popularity, in this thesis, we will also take a step back to identify the fundamental limits of uniform convergence as a tool to explain generalization. In particular, we will show that in some example overparameterized settings, {\em any} uniform convergence bound will provide only a vacuous generalization bound.
With this realization in mind, in the last part of the thesis, we will change course and introduce an {\em empirical} technique to estimate generalization using unlabeled data. Our technique does not rely on any notion of uniform-convergece-based complexity and is remarkably precise. We will theoretically show why our technique enjoys such precision.
We will conclude by discussing how future work could explore novel ways to incorporate distributional assumptions in generalization bounds (such as in the form of unlabeled data) and explore other tools to derive bounds, perhaps by modifying uniform convergence or by developing completely new tools altogether.

[27]  arXiv:2110.08984 (cross-list from cs.LG) [pdf, ps, other]
Title: Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs). In this setting, both the reward function and the transition kernel are linear with respect to the given feature maps and are allowed to vary over time, as long as their respective parameter variations do not exceed certain variation budgets. We propose the $\underline{\text{p}}$eriodically $\underline{\text{r}}$estarted $\underline{\text{o}}$ptimistic $\underline{\text{p}}$olicy $\underline{\text{o}}$ptimization algorithm (PROPO), which is an optimistic policy optimization algorithm with linear function approximation. PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement, which are tailored for policy optimization in a non-stationary environment. In addition, only utilizing the technique of sliding window, we propose a value-iteration algorithm. We establish dynamic upper bounds for the proposed methods and a matching minimax lower bound which shows the (near-) optimality of the proposed methods. To our best knowledge, PROPO is the first provably efficient policy optimization algorithm that handles non-stationarity.

[28]  arXiv:2110.08985 (cross-list from cs.CV) [pdf, other]
Title: StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis
Comments: 24 pages, 19 figures. Project page: this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

We propose StyleNeRF, a 3D-aware generative model for photo-realistic high-resolution image synthesis with high multi-view consistency, which can be trained on unstructured 2D images. Existing approaches either cannot synthesize high-resolution images with fine details or yield noticeable 3D-inconsistent artifacts. In addition, many of them lack control over style attributes and explicit 3D camera poses. StyleNeRF integrates the neural radiance field (NeRF) into a style-based generator to tackle the aforementioned challenges, i.e., improving rendering efficiency and 3D consistency for high-resolution image generation. We perform volume rendering only to produce a low-resolution feature map and progressively apply upsampling in 2D to address the first issue. To mitigate the inconsistencies caused by 2D upsampling, we propose multiple designs, including a better upsampler and a new regularization loss. With these designs, StyleNeRF can synthesize high-resolution images at interactive rates while preserving 3D consistency at high quality. StyleNeRF also enables control of camera poses and different levels of styles, which can generalize to unseen views. It also supports challenging tasks, including zoom-in and-out, style mixing, inversion, and semantic editing.

[29]  arXiv:2110.09006 (cross-list from cs.CV) [pdf, other]
Title: Natural Image Reconstruction from fMRI using Deep Learning: A Survey
Subjects: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)

With the advent of brain imaging techniques and machine learning tools, much effort has been devoted to building computational models to capture the encoding of visual information in the human brain. One of the most challenging brain decoding tasks is the accurate reconstruction of the perceived natural images from brain activities measured by functional magnetic resonance imaging (fMRI). In this work, we survey the most recent deep learning methods for natural image reconstruction from fMRI. We examine these methods in terms of architectural design, benchmark datasets, and evaluation metrics and present a fair performance evaluation across standardized evaluation metrics. Finally, we discuss the strengths and limitations of existing studies and present potential future directions.

[30]  arXiv:2110.09040 (cross-list from stat.ME) [pdf, ps, other]
Title: A Bayesian approach to multi-task learning with network lasso
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)

Network lasso is a method for solving a multi-task learning problem through the regularized maximum likelihood method. A characteristic of network lasso is setting a different model for each sample. The relationships among the models are represented by relational coefficients. A crucial issue in network lasso is to provide appropriate values for these relational coefficients. In this paper, we propose a Bayesian approach to solve multi-task learning problems by network lasso. This approach allows us to objectively determine the relational coefficients by Bayesian estimation. The effectiveness of the proposed method is shown in a simulation study and a real data analysis.

[31]  arXiv:2110.09042 (cross-list from math.ST) [pdf, other]
Title: Kernel-based estimation for partially functional linear model: Minimax rates and randomized sketches
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

This paper considers the partially functional linear model (PFLM) where all predictive features consist of a functional covariate and a high dimensional scalar vector. Over an infinite dimensional reproducing kernel Hilbert space, the proposed estimation for PFLM is a least square approach with two mixed regularizations of a function-norm and an $\ell_1$-norm. Our main task in this paper is to establish the minimax rates for PFLM under high dimensional setting, and the optimal minimax rates of estimation is established by using various techniques in empirical process theory for analyzing kernel classes. In addition, we propose an efficient numerical algorithm based on randomized sketches of the kernel matrix. Several numerical experiments are implemented to support our method and optimization strategy.

[32]  arXiv:2110.09140 (cross-list from cs.LG) [pdf, other]
Title: Learning Prototype-oriented Set Representations for Meta-Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Learning from set-structured data is a fundamental problem that has recently attracted increasing attention, where a series of summary networks are introduced to deal with the set input. In fact, many meta-learning problems can be treated as set-input tasks. Most existing summary networks aim to design different architectures for the input set in order to enforce permutation invariance. However, scant attention has been paid to the common cases where different sets in a meta-distribution are closely related and share certain statistical properties. Viewing each set as a distribution over a set of global prototypes, this paper provides a novel optimal transport (OT) based way to improve existing summary networks. To learn the distribution over the global prototypes, we minimize its OT distance to the set empirical distribution over data points, providing a natural unsupervised way to improve the summary network. Since our plug-and-play framework can be applied to many meta-learning problems, we further instantiate it to the cases of few-shot classification and implicit meta generative modeling. Extensive experiments demonstrate that our framework significantly improves the existing summary networks on learning more powerful summary statistics from sets and can be successfully integrated into metric-based few-shot classification and generative modeling applications, providing a promising tool for addressing set-input and meta-learning problems.

[33]  arXiv:2110.09192 (cross-list from cs.LG) [pdf, other]
Title: Learning Optimal Conformal Classifiers
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME); Machine Learning (stat.ML)

Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's probability estimates to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. We show that CT outperforms state-of-the-art CP methods for classification by reducing the average confidence set size (inefficiency). Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.

[34]  arXiv:2110.09253 (cross-list from cs.CY) [pdf]
Title: A Sociotechnical View of Algorithmic Fairness
Comments: Accepted at Information Systems Journal
Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)

Algorithmic fairness has been framed as a newly emerging technology that mitigates systemic discrimination in automated decision-making, providing opportunities to improve fairness in information systems (IS). However, based on a state-of-the-art literature review, we argue that fairness is an inherently social concept and that technologies for algorithmic fairness should therefore be approached through a sociotechnical lens. We advance the discourse on algorithmic fairness as a sociotechnical phenomenon. Our research objective is to embed AF in the sociotechnical view of IS. Specifically, we elaborate on why outcomes of a system that uses algorithmic means to assure fairness depends on mutual influences between technical and social structures. This perspective can generate new insights that integrate knowledge from both technical fields and social studies. Further, it spurs new directions for IS debates. We contribute as follows: First, we problematize fundamental assumptions in the current discourse on algorithmic fairness based on a systematic analysis of 310 articles. Second, we respond to these assumptions by theorizing algorithmic fairness as a sociotechnical construct. Third, we propose directions for IS researchers to enhance their impacts by pursuing a unique understanding of sociotechnical algorithmic fairness. We call for and undertake a holistic approach to AF. A sociotechnical perspective on algorithmic fairness can yield holistic solutions to systemic biases and discrimination.

[35]  arXiv:2110.09327 (cross-list from cs.LG) [pdf, other]
Title: Self-Supervised Representation Learning: Introduction, Advances and Challenges
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Self-supervised representation learning methods aim to provide powerful deep feature learning without the requirement of large annotated datasets, thus alleviating the annotation bottleneck that is one of the main barriers to practical deployment of deep learning today. These methods have advanced rapidly in recent years, with their efficacy approaching and sometimes surpassing fully supervised pre-training alternatives across a variety of data modalities including image, video, sound, text and graphs. This article introduces this vibrant area including key concepts, the four main families of approach and associated state of the art, and how self-supervised methods are applied to diverse modalities of data. We further discuss practical considerations including workflows, representation transferability, and compute cost. Finally, we survey the major open challenges in the field that provide fertile ground for future work.

[36]  arXiv:2110.09333 (cross-list from math.ST) [pdf, other]
Title: Regression with Missing Data, a Comparison Study of TechniquesBased on Random Forests
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

In this paper we present the practical benefits of a new random forest algorithm to deal withmissing values in the sample. The purpose of this work is to compare the different solutionsto deal with missing values with random forests and describe our new algorithm performanceas well as its algorithmic complexity. A variety of missing value mechanisms (such as MCAR,MAR, MNAR) are considered and simulated. We study the quadratic errors and the bias ofour algorithm and compare it to the most popular missing values random forests algorithms inthe literature. In particular, we compare those techniques for both a regression and predictionpurpose. This work follows a first paper Gomez-Mendez and Joly (2020) on the consistency ofthis new algorithm.

[37]  arXiv:2110.09334 (cross-list from math.OC) [pdf, other]
Title: A portfolio approach to massively parallel Bayesian optimization
Subjects: Optimization and Control (math.OC); Machine Learning (stat.ML)

One way to reduce the time of conducting optimization studies is to evaluate designs in parallel rather than just one-at-a-time. For expensive-to-evaluate black-boxes, batch versions of Bayesian optimization have been proposed. They work by building a surrogate model of the black-box that can be used to select the designs to evaluate efficiently via an infill criterion. Still, with higher levels of parallelization becoming available, the strategies that work for a few tens of parallel evaluations become limiting, in particular due to the complexity of selecting more evaluations. It is even more crucial when the black-box is noisy, necessitating more evaluations as well as repeating experiments. Here we propose a scalable strategy that can keep up with massive batching natively, focused on the exploration/exploitation trade-off and a portfolio allocation. We compare the approach with related methods on deterministic and noisy functions, for mono and multiobjective optimization tasks. These experiments show similar or better performance than existing methods, while being orders of magnitude faster.

[38]  arXiv:2110.09356 (cross-list from cs.LG) [pdf, other]
Title: Towards Federated Bayesian Network Structure Learning with Continuous Optimization
Comments: 16 pages; 5 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Traditionally, Bayesian network structure learning is often carried out at a central site, in which all data is gathered. However, in practice, data may be distributed across different parties (e.g., companies, devices) who intend to collectively learn a Bayesian network, but are not willing to disclose information related to their data owing to privacy or security concerns. In this work, we present a cross-silo federated learning approach to estimate the structure of Bayesian network from data that is horizontally partitioned across different parties. We develop a distributed structure learning method based on continuous optimization, using the alternating direction method of multipliers (ADMM), such that only the model parameters have to be exchanged during the optimization process. We demonstrate the flexibility of our approach by adopting it for both linear and nonlinear cases. Experimental results on synthetic and real datasets show that it achieves an improved performance over the other methods, especially when there is a relatively large number of clients and each has a limited sample size.

[39]  arXiv:2110.09443 (cross-list from cs.LG) [pdf, other]
Title: Beltrami Flow and Neural Diffusion on Graphs
Comments: 21 pages, 5 figures. Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) 2021
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

We propose a novel class of graph neural networks based on the discretised Beltrami flow, a non-Euclidean diffusion PDE. In our model, node features are supplemented with positional encodings derived from the graph topology and jointly evolved by the Beltrami flow, producing simultaneously continuous feature learning and topology evolution. The resulting model generalises many popular graph neural networks and achieves state-of-the-art results on several benchmarks.

[40]  arXiv:2110.09468 (cross-list from cs.LG) [pdf, other]
Title: Improving Robustness using Generated Data
Comments: Accepted at NeurIPS 2021
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Recent work argues that robust training requires substantially larger datasets than those required for standard classification. On CIFAR-10 and CIFAR-100, this translates into a sizable robust-accuracy gap between models trained solely on data from the original training set and those trained with additional data extracted from the "80 Million Tiny Images" dataset (TI-80M). In this paper, we explore how generative models trained solely on the original training set can be leveraged to artificially increase the size of the original training set and improve adversarial robustness to $\ell_p$ norm-bounded perturbations. We identify the sufficient conditions under which incorporating additional generated data can improve robustness, and demonstrate that it is possible to significantly reduce the robust-accuracy gap to models trained with additional real data. Surprisingly, we even show that even the addition of non-realistic random data (generated by Gaussian sampling) can improve robustness. We evaluate our approach on CIFAR-10, CIFAR-100, SVHN and TinyImageNet against $\ell_\infty$ and $\ell_2$ norm-bounded perturbations of size $\epsilon = 8/255$ and $\epsilon = 128/255$, respectively. We show large absolute improvements in robust accuracy compared to previous state-of-the-art methods. Against $\ell_\infty$ norm-bounded perturbations of size $\epsilon = 8/255$, our models achieve 66.10% and 33.49% robust accuracy on CIFAR-10 and CIFAR-100, respectively (improving upon the state-of-the-art by +8.96% and +3.29%). Against $\ell_2$ norm-bounded perturbations of size $\epsilon = 128/255$, our model achieves 78.31% on CIFAR-10 (+3.81%). These results beat most prior works that use external data.

[41]  arXiv:2110.09476 (cross-list from cs.LG) [pdf, other]
Title: Recovery Guarantees for Kernel-based Clustering under Non-parametric Mixture Models
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Despite the ubiquity of kernel-based clustering, surprisingly few statistical guarantees exist beyond settings that consider strong structural assumptions on the data generation process. In this work, we take a step towards bridging this gap by studying the statistical performance of kernel-based clustering algorithms under non-parametric mixture models. We provide necessary and sufficient separability conditions under which these algorithms can consistently recover the underlying true clustering. Our analysis provides guarantees for kernel clustering approaches without structural assumptions on the form of the component distributions. Additionally, we establish a key equivalence between kernel-based data-clustering and kernel density-based clustering. This enables us to provide consistency guarantees for kernel-based estimators of non-parametric mixture models. Along with theoretical implications, this connection could have practical implications, including in the systematic choice of the bandwidth of the Gaussian kernel in the context of clustering.

[42]  arXiv:2110.09502 (cross-list from math.ST) [pdf, other]
Title: Minimum $\ell_{1}$-norm interpolators: Precise asymptotics and multiple descent
Authors: Yue Li, Yuting Wei
Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)

An evolving line of machine learning works observe empirical evidence that suggests interpolating estimators -- the ones that achieve zero training error -- may not necessarily be harmful. This paper pursues theoretical understanding for an important type of interpolators: the minimum $\ell_{1}$-norm interpolator, which is motivated by the observation that several learning algorithms favor low $\ell_1$-norm solutions in the over-parameterized regime. Concretely, we consider the noisy sparse regression model under Gaussian design, focusing on linear sparsity and high-dimensional asymptotics (so that both the number of features and the sparsity level scale proportionally with the sample size).
We observe, and provide rigorous theoretical justification for, a curious multi-descent phenomenon; that is, the generalization risk of the minimum $\ell_1$-norm interpolator undergoes multiple (and possibly more than two) phases of descent and ascent as one increases the model capacity. This phenomenon stems from the special structure of the minimum $\ell_1$-norm interpolator as well as the delicate interplay between the over-parameterized ratio and the sparsity, thus unveiling a fundamental distinction in geometry from the minimum $\ell_2$-norm interpolator. Our finding is built upon an exact characterization of the risk behavior, which is governed by a system of two non-linear equations with two unknowns.

[43]  arXiv:2110.09507 (cross-list from cs.LG) [pdf, other]
Title: Provable Hierarchy-Based Meta-Reinforcement Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Hierarchical reinforcement learning (HRL) has seen widespread interest as an approach to tractable learning of complex modular behaviors. However, existing work either assume access to expert-constructed hierarchies, or use hierarchy-learning heuristics with no provable guarantees. To address this gap, we analyze HRL in the meta-RL setting, where a learner learns latent hierarchical structure during meta-training for use in a downstream task. We consider a tabular setting where natural hierarchical structure is embedded in the transition dynamics. Analogous to supervised meta-learning theory, we provide "diversity conditions" which, together with a tractable optimism-based algorithm, guarantee sample-efficient recovery of this natural hierarchy. Furthermore, we provide regret bounds on a learner using the recovered hierarchy to solve a meta-test task. Our bounds incorporate common notions in HRL literature such as temporal and state/action abstractions, suggesting that our setting and analysis capture important features of HRL in practice.

[44]  arXiv:2110.09514 (cross-list from cs.LG) [pdf, other]
Title: Discovering and Achieving Goals via World Models
Comments: NeurIPS 2021. First two authors contributed equally. Website at this https URL
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Machine Learning (stat.ML)

How can artificial agents learn to solve many diverse tasks in complex visual environments in the absence of any supervision? We decompose this question into two problems: discovering new goals and learning to reliably achieve them. We introduce Latent Explorer Achiever (LEXA), a unified solution to these that learns a world model from image inputs and uses it to train an explorer and an achiever policy from imagined rollouts. Unlike prior methods that explore by reaching previously visited states, the explorer plans to discover unseen surprising states through foresight, which are then used as diverse targets for the achiever to practice. After the unsupervised phase, LEXA solves tasks specified as goal images zero-shot without any additional learning. LEXA substantially outperforms previous approaches to unsupervised goal-reaching, both on prior benchmarks and on a new challenging benchmark with a total of 40 test tasks spanning across four standard robotic manipulation and locomotion domains. LEXA further achieves goals that require interacting with multiple objects in sequence. Finally, to demonstrate the scalability and generality of LEXA, we train a single general agent across four distinct environments. Code and videos at https://orybkin.github.io/lexa/

Replacements for Tue, 19 Oct 21

[45]  arXiv:1901.08057 (replaced) [pdf, ps, other]
Title: Large dimensional analysis of general margin based classification methods
Comments: 33 pages, 5 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
[46]  arXiv:2007.02794 (replaced) [pdf, other]
Title: Efficient Connected and Automated Driving Systemwith Multi-agent Graph Reinforcement Learning
Comments: the paper is not even ready
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[47]  arXiv:2007.04803 (replaced) [pdf, other]
Title: A Global Stochastic Optimization Particle Filter Algorithm
Comments: 61 pages, 4 figures
Subjects: Machine Learning (stat.ML); Statistics Theory (math.ST); Computation (stat.CO)
[48]  arXiv:2007.14052 (replaced) [pdf, other]
Title: Multioutput Gaussian Processes with Functional Data: A Study on Coastal Flood Hazard Assessment
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
[49]  arXiv:2007.14861 (replaced) [pdf, ps, other]
Title: Efficient Sparse Secure Aggregation for Federated Learning
Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
[50]  arXiv:2008.13443 (replaced) [pdf, other]
Title: On the Quality Requirements of Demand Prediction for Dynamic Public Transport
Comments: 26 pages, 9 tables, 6 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
[51]  arXiv:2010.00373 (replaced) [pdf, other]
Title: Task Agnostic Continual Learning Using Online Variational Bayes with Fixed-Point Updates
Comments: The arXiv paper "Task Agnostic Continual Learning Using Online Variational Bayes" is a preliminary pre-print of this paper. The main differences between the versions are: 1. We develop new algorithmic framework (FOO-VB). 2. We add multivariate Gaussian and matrix variate Gaussian versions of the algorithm. 3. We demonstrate the new algorithm performance in task agnostic scenarios
Journal-ref: Neural Comput 2021; 33 (11)
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[52]  arXiv:2105.03425 (replaced) [pdf, other]
Title: Kernel Two-Sample Tests for Manifold Data
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
[53]  arXiv:2105.08532 (replaced) [pdf, other]
Title: Robust Learning in Heterogeneous Contexts
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[54]  arXiv:2106.03227 (replaced) [pdf, other]
Title: Neural Tangent Kernel Maximum Mean Discrepancy
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
[55]  arXiv:2106.03762 (replaced) [pdf, other]
Title: Frustratingly Easy Uncertainty Estimation for Distribution Shift
Comments: 17 pages, 4 Tables, 9 Figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[56]  arXiv:2106.05565 (replaced) [pdf, other]
Title: Identifiability of interaction kernels in mean-field equations of interacting particles
Authors: Quanjun Lang, Fei Lu
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
[57]  arXiv:2106.09215 (replaced) [pdf, other]
Title: Optimum-statistical Collaboration Towards General and Efficient Black-box Optimization
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[58]  arXiv:2106.15358 (replaced) [pdf, other]
Title: Towards Sample-Optimal Compressive Phase Retrieval with Sparse and Generative Priors
Comments: Accepted to NeurIPS 2021
Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
[59]  arXiv:2107.10884 (replaced) [pdf, other]
Title: Structured second-order methods via natural gradient descent
Comments: Fixed some typos. ICML workshop paper. A short version of arXiv:2102.07405 with a focus on optimization tasks
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[60]  arXiv:2109.05578 (replaced) [pdf, other]
Title: Kernel PCA with the Nyström method
Authors: Fredrik Hallgren
Comments: 44 pages, 6 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
[61]  arXiv:2109.05583 (replaced) [pdf, ps, other]
Title: Automatic Componentwise Boosting: An Interpretable AutoML System
Comments: 6 pages, 4 figures, ECML-PKDD Workshop on Automating Data Science 2021
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[62]  arXiv:2110.00629 (replaced) [pdf, other]
Title: Factored couplings in multi-marginal optimal transport via difference of convex programming
Comments: Fix typo and correct the corollary 3.3
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
[63]  arXiv:2110.01593 (replaced) [pdf, other]
Title: Generalized Kernel Thinning
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
[64]  arXiv:2110.06021 (replaced) [pdf, other]
Title: Embedded-model flows: Combining the inductive biases of model-free deep learning and explicit probabilistic modeling
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[65]  arXiv:1712.01145 (replaced) [pdf, other]
Title: Learning Fast and Slow: PROPEDEUTICA for Real-time Malware Detection
Comments: 12 pages, 4 figures. This paper has been accepted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
[66]  arXiv:1902.04742 (replaced) [pdf, other]
Title: Uniform convergence may be unable to explain generalization in deep learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[67]  arXiv:1910.09714 (replaced) [pdf, other]
Title: Smoothness-Adaptive Contextual Bandits
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[68]  arXiv:1911.02319 (replaced) [pdf, other]
Title: Improving reinforcement learning algorithms: towards optimal learning rate policies
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
[69]  arXiv:2003.00660 (replaced) [pdf, ps, other]
Title: Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
[70]  arXiv:2003.06566 (replaced) [pdf, other]
Title: On the benefits of defining vicinal distributions in latent space
Comments: Accepted at Elsevier Pattern Recognition Letters (2021), Best Paper Award at CVPR 2021 Workshop on Adversarial Machine Learning in Real-World Computer Vision (AML-CV), Also accepted at ICLR 2021 Workshops on Robust-Reliable Machine Learning (Oral) and Generalization beyond the training distribution (Abstract)
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[71]  arXiv:2005.03566 (replaced) [pdf, other]
Title: Noisy Differentiable Architecture Search
Comments: BMVC 2021
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[72]  arXiv:2006.05842 (replaced) [pdf, other]
Title: The Emergence of Individuality
Comments: The extended version of ICML 2021 paper
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
[73]  arXiv:2007.00823 (replaced) [pdf, other]
Title: Dropout as a Regularizer of Interaction Effects
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[74]  arXiv:2007.03408 (replaced) [pdf, other]
Title: A Generative Model for Texture Synthesis based on Optimal Transport between Feature Distributions
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[75]  arXiv:2009.06087 (replaced) [pdf, other]
Title: Neural Networks Enhancement with Logical Knowledge
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
[76]  arXiv:2010.01777 (replaced) [pdf, other]
Title: A Unified View on Graph Neural Networks as Graph Signal Denoising
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[77]  arXiv:2011.05348 (replaced) [pdf, other]
Title: SALR: Sharpness-aware Learning Rate Scheduler for Improved Generalization
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[78]  arXiv:2101.12353 (replaced) [pdf, other]
Title: On the capacity of deep generative networks for approximating distributions
Subjects: Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
[79]  arXiv:2104.07084 (replaced) [pdf, other]
Title: Grouped Variable Selection with Discrete Optimization: Computational and Statistical Perspectives
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
[80]  arXiv:2104.11734 (replaced) [pdf, other]
Title: Exact marginal prior distributions of finite Bayesian neural networks
Comments: 12+9 pages, 4 figures; v3: Accepted as NeurIPS 2021 Spotlight
Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
[81]  arXiv:2105.07025 (replaced) [pdf, other]
Title: Minimal Cycle Representatives in Persistent Homology using Linear Programming: an Empirical Study with User's Guide
Subjects: Algebraic Topology (math.AT); Computational Geometry (cs.CG); Machine Learning (stat.ML)
[82]  arXiv:2105.08024 (replaced) [pdf, other]
Title: Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
[83]  arXiv:2106.05232 (replaced) [pdf, ps, other]
Title: Realizing GANs via a Tunable Loss Function
Comments: Extended version of a paper accepted to ITW 2021. 8 pages, 2 figures
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
[84]  arXiv:2106.06134 (replaced) [pdf, other]
Title: Is Homophily a Necessity for Graph Neural Networks?
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[85]  arXiv:2106.10065 (replaced) [pdf, other]
Title: Being a Bit Frequentist Improves Bayesian Neural Networks
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[86]  arXiv:2106.12423 (replaced) [pdf, other]
Title: Alias-Free Generative Adversarial Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
[87]  arXiv:2106.13423 (replaced) [pdf, other]
Title: Federated Graph Classification over Non-IID Graphs
Comments: Accepted to NeurIPS 2021
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
[88]  arXiv:2107.00520 (replaced) [pdf, other]
Title: Predictive Modeling in the Presence of Nuisance-Induced Spurious Correlations
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[89]  arXiv:2107.00758 (replaced) [pdf, other]
Title: The Spotlight: A General Method for Discovering Systematic Errors in Deep Learning Models
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[90]  arXiv:2107.05686 (replaced) [pdf, other]
Title: The Role of Pretrained Representations for the OOD Generalization of RL Agents
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[91]  arXiv:2108.08987 (replaced) [pdf, other]
Title: Uniformity Testing in the Shuffle Model: Simpler, Better, Faster
Comments: Accepted to the SIAM Symposium on Simplicity in Algorithms (SOSA 2022). Added some details and discussions
Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Discrete Mathematics (cs.DM); Machine Learning (stat.ML)
[92]  arXiv:2108.09676 (replaced) [pdf, other]
Title: Efficient Gaussian Neural Processes for Regression
Comments: 6 pages
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[93]  arXiv:2108.10566 (replaced) [pdf, other]
Title: sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[94]  arXiv:2109.05675 (replaced) [pdf, other]
Title: Online Unsupervised Learning of Visual Representations and Categories
Comments: Technical report, 28 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[95]  arXiv:2110.07959 (replaced) [pdf, other]
Title: Low-rank Matrix Recovery With Unknown Correspondence
Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
[96]  arXiv:2110.08211 (replaced) [pdf, other]
Title: Astronomical source finding services for the CIRASA visual analytic platform
Comments: 16 pages, 6 figures
Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computation (stat.CO); Machine Learning (stat.ML)
[ total of 96 entries: 1-96 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2110, contact, help  (Access key information)