We gratefully acknowledge support from
the Simons Foundation and member institutions.


New submissions

[ total of 164 entries: 1-164 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 21 Feb 20

[1]  arXiv:2002.08404 [pdf, other]
Title: Implicit Regularization of Random Feature Models
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Random Feature (RF) models are used as efficient parametric approximations of kernel methods. We investigate, by means of random matrix theory, the connection between Gaussian RF models and Kernel Ridge Regression (KRR). For a Gaussian RF model with $P$ features, $N$ data points, and a ridge $\lambda$, we show that the average (i.e. expected) RF predictor is close to a KRR predictor with an effective ridge $\tilde{\lambda}$. We show that $\tilde{\lambda} > \lambda$ and $\tilde{\lambda} \searrow \lambda$ monotonically as $P$ grows, thus revealing the implicit regularization effect of finite RF sampling. We then compare the risk (i.e. test error) of the $\tilde{\lambda}$-KRR predictor with the average risk of the $\lambda$-RF predictor and obtain a precise and explicit bound on their difference. Finally, we empirically find an extremely good agreement between the test errors of the average $\lambda$-RF predictor and $\tilde{\lambda}$-KRR predictor.

[2]  arXiv:2002.08409 [pdf, other]
Title: On the geometric properties of finite mixture models
Subjects: Statistics Theory (math.ST)

In this paper we relate the geometry of extremal points to properties of mixtures of distributions. For a mixture model in $\mathbb{R}^J$ we consider as a prior the mixing density given by a uniform draw of $n$ points from the unit $(J-1)$-simplex, with $J \leq n$. We relate the extrema of these $n$ points to a mixture model with $m \leq n$ mixture components. We first show that the extrema of the points can recover any mixture density in the convex hull of the the $n$ points via the Choquet measure. We then show that as the number of extremal points go to infinity the convex hull converges to a smooth convex body. We also state a Central Limit Theorem for the number of extremal points. In addition, we state the convergence of the sequence of the empirical measures generated by our model to the Choquet measure. We relate our model to a classical non-parametric one based on a P\'olya tree. We close with an application of our model to population genomics.

[3]  arXiv:2002.08410 [pdf, other]
Title: A Unified Framework for Gaussian Mixture Reduction with Composite Transportation Distance
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Gaussian mixture reduction (GMR) is the problem of approximating a finite Gaussian mixture by one with fewer components. It is widely used in density estimation, nonparametric belief propagation, and Bayesian recursive filtering. Although optimization and clustering-based algorithms have been proposed for GMR, they are either computationally expensive or lacking in theoretical supports. In this work, we propose to perform GMR by minimizing the entropic regularized composite transportation distance between two mixtures. We show our approach provides a unified framework for GMR that is both interpretable and computationally efficient. Our work also bridges the gap between optimization and clustering-based approaches for GMR. A Majorization-Minimization algorithm is developed for our optimization problem and its theoretical convergence is also established in this paper. Empirical experiments are also conducted to show the effectiveness of GMR. The effect of the choice of transportation cost on the performance of GMR is also investigated.

[4]  arXiv:2002.08412 [pdf, other]
Title: Weakly-supervised Multi-output Regression via Correlated Gaussian Processes
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Multi-output regression seeks to infer multiple latent functions using data from multiple groups/sources while accounting for potential between-group similarities. In this paper, we consider multi-output regression under a weakly-supervised setting where a subset of data points from multiple groups are unlabeled. We use dependent Gaussian processes for multiple outputs constructed by convolutions with shared latent processes. We introduce hyperpriors for the multinomial probabilities of the unobserved labels and optimize the hyperparameters which we show improves estimation. We derive two variational bounds: (i) a modified variational bound for fast and stable convergence in model inference, (ii) a scalable variational bound that is amenable to stochastic optimization. We use experiments on synthetic and real-world data to show that the proposed model outperforms state-of-the-art models with more accurate estimation of multiple latent functions and unobserved labels.

[5]  arXiv:2002.08422 [pdf, other]
Title: On conditional versus marginal bias in multi-armed bandits
Comments: 20 pages
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

The bias of the sample means of the arms in multi-armed bandits is an important issue in adaptive data analysis that has recently received considerable attention in the literature. Existing results relate in precise ways the sign and magnitude of the bias to various sources of data adaptivity, but do not apply to the conditional inference setting in which the sample means are computed only if some specific conditions are satisfied. In this paper, we characterize the sign of the conditional bias of monotone functions of the rewards, including the sample mean. Our results hold for arbitrary conditioning events and leverage natural monotonicity properties of the data collection policy. We further demonstrate, through several examples from sequential testing and best arm identification, that the sign of the conditional and unconditional bias of the sample mean of an arm can be different, depending on the conditioning event. Our analysis offers new and interesting perspectives on the subtleties of assessing the bias in data adaptive settings.

[6]  arXiv:2002.08436 [pdf, other]
Title: Residual Bootstrap Exploration for Bandit Algorithms
Comments: The first two authors contributed equally
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

In this paper, we propose a novel perturbation-based exploration method in bandit algorithms with bounded or unbounded rewards, called residual bootstrap exploration (\texttt{ReBoot}). The \texttt{ReBoot} enforces exploration by injecting data-driven randomness through a residual-based perturbation mechanism. This novel mechanism captures the underlying distributional properties of fitting errors, and more importantly boosts exploration to escape from suboptimal solutions (for small sample sizes) by inflating variance level in an \textit{unconventional} way. In theory, with appropriate variance inflation level, \texttt{ReBoot} provably secures instance-dependent logarithmic regret in Gaussian multi-armed bandits. We evaluate the \texttt{ReBoot} in different synthetic multi-armed bandits problems and observe that the \texttt{ReBoot} performs better for unbounded rewards and more robustly than \texttt{Giro} \cite{kveton2018garbage} and \texttt{PHE} \cite{kveton2019perturbed}, with comparable computational efficiency to the Thompson sampling method.

[7]  arXiv:2002.08443 [pdf, other]
Title: Simultaneous Inference for Massive Data: Distributed Bootstrap
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without over-resampling, typically required by existing methods \cite{kleiner2014scalable,sengupta2016subsampled}, while provably achieving optimal statistical efficiency with minimal communication. Our method does not require repeatedly re-fitting the model but only applies multiplier bootstrap in the master machine on the gradients received from the worker machines. Simulations validate our theory.

[8]  arXiv:2002.08457 [pdf, other]
Title: ivmodel: An R Package for Inference and Sensitivity Analysis of Instrumental Variables Models with One Endogenous Variable
Comments: 24 pages, 2 figures, 3 tables
Subjects: Applications (stat.AP)

We present a comprehensive R software ivmodel for analyzing instrumental variables with one endogenous variable. The package implements a general class of estimators called k- class estimators and two confidence intervals that are fully robust to weak instruments. The package also provides power formulas for various test statistics in instrumental variables. Finally, the package contains methods for sensitivity analysis to examine the sensitivity of the inference to instrumental variables assumptions. We demonstrate the software on the data set from Card (1995), looking at the causal effect of levels of education on log earnings where the instrument is proximity to a four-year college.

[9]  arXiv:2002.08465 [pdf, other]
Title: Descriptive and Predictive Analysis of Euroleague Basketball Games and the Wisdom of Basketball Crowds
Comments: 24 pages, several figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Other Statistics (stat.OT)

In this study we focus on the prediction of basketball games in the Euroleague competition using machine learning modelling. The prediction is a binary classification problem, predicting whether a match finishes 1 (home win) or 2 (away win). Data is collected from the Euroleague's official website for the seasons 2016-2017, 2017-2018 and 2018-2019, i.e. in the new format era. Features are extracted from matches' data and off-the-shelf supervised machine learning techniques are applied. We calibrate and validate our models. We find that simple machine learning models give accuracy not greater than 67% on the test set, worse than some sophisticated benchmark models. Additionally, the importance of this study lies in the "wisdom of the basketball crowd" and we demonstrate how the predicting power of a collective group of basketball enthusiasts can outperform machine learning models discussed in this study. We argue why the accuracy level of this group of "experts" should be set as the benchmark for future studies in the prediction of (European) basketball games using machine learning.

[10]  arXiv:2002.08476 [pdf, other]
Title: A non-inferiority test for R-squared with random regressors
Authors: Harlan Campbell
Comments: 14 pages, 2 figures
Subjects: Methodology (stat.ME)

Determining the lack of association between an outcome variable and a number of different explanatory variables is frequently necessary in order to disregard a proposed model. This paper proposes a non-inferiority test for the coefficient of determination (or squared multiple correlation coefficient), R-squared, in a linear regression analysis with random predictors. The test is derived from inverting a one-sided confidence interval based on a scaled central F distribution.

[11]  arXiv:2002.08505 [pdf, other]
Title: A Bayes Factor Approach with Informative Prior for Rare Genetic Variant Analysis from Next Generation Sequencing Data
Subjects: Applications (stat.AP)

The discovery of rare genetic variants through Next Generation Sequencing is a very challenging issue in the field of human genetics. We propose a novel region-based statistical approach based on a Bayes Factor (BF) to assess evidence of association between a set of rare variants (RVs) located on the same genomic region and a disease outcome in the context of case-control design. Marginal likelihoods are computed under the null and alternative hypotheses assuming a binomial distribution for the RV count in the region and a beta or mixture of Dirac and beta prior distribution for the probability of RV. We derive the theoretical null distribution of the BF under our prior setting and show that a Bayesian control of the False Discovery Rate (BFDR) can be obtained for genome-wide inference. Informative priors are introduced using prior evidence of association from a Kolmogorov-Smirnov test statistic. We use our simulation program, sim1000G, to generate RV data similar to the 1,000 genomes sequencing project. Our simulation studies showed that the new BF statistic outperforms standard methods (SKAT, SKAT-O, Burden test) in case-control studies with moderate sample sizes and is equivalent to them under large sample size scenarios. Our real data application to a lung cancer case-control study found enrichment for RVs in known and novel cancer genes. It also suggests that using the BF with informative prior improves the overall gene discovery compared to the BF with non-informative prior.

[12]  arXiv:2002.08506 [pdf, other]
Title: Causal Inference under Networked Interference
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)

Estimating individual treatment effects from data of randomized experiments is a critical task in causal inference. The Stable Unit Treatment Value Assumption (SUTVA) is usually made in causal inference. However, interference can introduce bias when the assigned treatment on one unit affects the potential outcomes of the neighboring units. This interference phenomenon is known as spillover effect in economics or peer effect in social science. Usually, in randomized experiments or observational studies with interconnected units, one can only observe treatment responses under interference. Hence, how to estimate the superimposed causal effect and recover the individual treatment effect in the presence of interference becomes a challenging task in causal inference. In this work, we study causal effect estimation under general network interference using GNNs, which are powerful tools for capturing the dependency in the graph. After deriving causal effect estimators, we further study intervention policy improvement on the graph under capacity constraint. We give policy regret bounds under network interference and treatment capacity constraint. Furthermore, a heuristic graph structure-dependent error bound for GNN-based causal estimators is provided.

[13]  arXiv:2002.08514 [pdf, ps, other]
Title: Queueing Subject To Action-Dependent Server Performance: Utilization Rate Reduction
Subjects: Applications (stat.AP); Optimization and Control (math.OC)

We consider a discrete-time system comprising a first-come-first-served queue, a non-preemptive server, and a stationary non-work-conserving scheduler. New tasks arrive at the queue according to a Bernoulli process. At each instant, the server is either busy working on a task or is available, in which case the scheduler either assigns a new task to the server or allows it to remain available (to rest). In addition to the aforementioned availability state, we assume that the server has an integer-valued activity state. The activity state is non-decreasing during work periods, and is non-increasing otherwise. In a typical application of our framework, the server performance (understood as task completion probability) worsens as the activity state increases. In this article, we expand on stabilizability results recently obtained for the same framework to establish methods to design scheduling policies that not only stabilize the queue but also reduce the utilization rate, which is understood as the infinite-horizon time-averaged expected portion of time the server is working. This article has a main theorem leading to two main results: (i) Given an arrival rate, we describe a tractable method, using a finite-dimensional linear program (LP), to compute the infimum of all utilization rates achievable by stabilizing scheduling policies. (ii) We propose a tractable method, also based on finite-dimensional LPs, to obtain stabilizing scheduling policies that are arbitrarily close to the aforementioned infimum. We also establish structural and distributional convergence properties, which are used throughout the article, and are significant in their own right.

[14]  arXiv:2002.08521 [pdf, other]
Title: Network Group Hawkes Process Model
Comments: 42 pages
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

In this work, we study the event occurrences of user activities on online social network platforms. To characterize the social activity interactions among network users, we propose a network group Hawkes (NGH) process model. Particularly, the observed network structure information is employed to model the users' dynamic posting behaviors. Furthermore, the users are clustered into latent groups according to their dynamic behavior patterns. To estimate the model, a constraint maximum likelihood approach is proposed. Theoretically, we establish the consistency and asymptotic normality of the estimators. In addition, we show that the group memberships can be identified consistently. To conduct estimation, a branching representation structure is firstly introduced, and a stochastic EM (StEM) algorithm is developed to tackle the computational problem. Lastly, we apply the proposed method to a social network data collected from Sina Weibo, and identify the infuential network users as an interesting application.

[15]  arXiv:2002.08541 [pdf, other]
Title: A Scalable Framework for Sparse Clustering Without Shrinkage
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Clustering, a fundamental activity in unsupervised learning, is notoriously difficult when the feature space is high-dimensional. Fortunately, in many realistic scenarios, only a handful of features are relevant in distinguishing clusters. This has motivated the development of sparse clustering techniques that typically rely on k-means within outer algorithms of high computational complexity. Current techniques also require careful tuning of shrinkage parameters, further limiting their scalability. In this paper, we propose a novel framework for sparse k-means clustering that is intuitive, simple to implement, and competitive with state-of-the-art algorithms. We show that our algorithm enjoys consistency and convergence guarantees. Our core method readily generalizes to several task-specific algorithms such as clustering on subsets of attributes and in partially observed data settings. We showcase these contributions via simulated experiments and benchmark datasets, as well as a case study on mouse protein expression.

[16]  arXiv:2002.08542 [pdf, other]
Title: False Discovery Rate Control via Data Splitting
Comments: 33 pages, 10 figures
Subjects: Methodology (stat.ME)

Selecting relevant features associated with a given response variable is an important issue in many scientific fields. Quantifying quality and uncertainty of the selection via the false discovery rate (FDR) control has been of recent interest. This paper introduces a way of using data-splitting strategies to asymptotically control FDR for various feature selection techniques while maintaining high power. For each feature, the method estimates two independent significance coefficients via data splitting, and constructs a contrast statistic. The FDR control is achieved by taking advantage of the statistic's property that, for any null feature, its sampling distribution is symmetric about 0. We further propose a strategy to aggregate multiple data splits (MDS) to stabilize the selection result and boost the power. Interestingly, this multiple data-splitting approach appears capable of overcoming the power loss caused by data splitting with FDR still under control. The proposed framework is applicable to canonical statistical models including linear models, Gaussian graphical models, and deep neural networks. Simulation results, as well as a real data application, show that the proposed approaches, especially the multiple data-splitting strategy, control FDR well and are often more powerful than existing methods including the Benjamini-Hochberg procedure and the knockoff filter.

[17]  arXiv:2002.08543 [pdf, ps, other]
Title: Derivation of the Exact Moments of the Distribution of Pearsons Correlation over Permutations of Data
Comments: 8 Pages
Subjects: Statistics Theory (math.ST)

Pearson's correlation is one of the most widely used measures of association today, the importance of which to modern science cannot be understated. Two of the most common methods for computing the p-value for a hypothesis test of this correlation method are a t-statistic and permutation sampling. When a dataset comes from a bivariate normal distribution under specific data transformations a t-statistic is exact. However, for datasets which do not follow this stipulation, both approaches are merely estimations of the distribution of over permutations of data. In this paper we explicitly show the dependency of the permutation distribution of Pearson's correlation on the central moments of the data and derive an inductive formula which allows the computation of these exact moments. This has direct implications for computing the p-value for general datasets which could lead to more computationally accurate methods.

[18]  arXiv:2002.08545 [pdf, other]
Title: Familywise Error Rate Control by Interactive Unmasking
Comments: 22 pages, 8 figures
Subjects: Methodology (stat.ME)

We propose a method for multiple hypothesis testing with familywise error rate (FWER) control, called the i-FWER test. Most testing methods are predefined algorithms that do not allow modifications after observing the data. However, in practice, analysts tend to choose a promising algorithm after observing the data; unfortunately, this violates the validity of the conclusion. The i-FWER test allows much flexibility: a human (or a computer program acting on the human's behalf) may adaptively guide the algorithm in a data-dependent manner. We prove that our test controls FWER if the analysts adhere to a particular protocol of "masking" and "unmasking". We demonstrate via numerical experiments the power of our test under structured non-nulls, and then explore new forms of masking.

[19]  arXiv:2002.08560 [pdf, other]
Title: Robust M-estimation for Partially Observed Functional Data
Comments: 38 pages, 5 figures
Subjects: Methodology (stat.ME)

Irregular functional data in which densely sampled curves are observed over different ranges pose a challenge for modeling and inference, and sensitivity to outlier curves is a concern in many applications. This paper investigates a class of robust M-estimators for partially observed functional data, modeling irregular structure using a missing data framework. We derive asymptotic normality of functional M-estimator under the proposed framework and show root-$n$ rates of convergence. Furthermore, we propose a class of functional trend tests to find significant directions in the trend of location. For the implementation of the inferential test, we adopt a joint bootstrap approach. The performance is demonstrated in simulations and application to data from quantitative ultrasound analysis.

[20]  arXiv:2002.08563 [pdf, other]
Title: The continuous categorical: a novel simplex-valued exponential family
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Simplex-valued data appear throughout statistics and machine learning, for example in the context of transfer learning and compression of deep networks. Existing models for this class of data rely on the Dirichlet distribution or other related loss functions; here we show these standard choices suffer systematically from a number of limitations, including bias and numerical issues that frustrate the use of flexible network models upstream of these distributions. We resolve these limitations by introducing a novel exponential family of distributions for modeling simplex-valued data - the continuous categorical, which arises as a nontrivial multivariate generalization of the recently discovered continuous Bernoulli. Unlike the Dirichlet and other typical choices, the continuous categorical results in a well-behaved probabilistic loss function that produces unbiased estimators, while preserving the mathematical simplicity of the Dirichlet. As well as exploring its theoretical properties, we introduce sampling methods for this distribution that are amenable to the reparameterization trick, and evaluate their performance. Lastly, we demonstrate that the continuous categorical outperforms standard choices empirically, across a simulation study, an applied example on multi-party elections, and a neural network compression task.

[21]  arXiv:2002.08609 [pdf, other]
Title: A Bayesian Feature Allocation Model for Identification of Cell Subpopulations Using Cytometry Data
Subjects: Applications (stat.AP)

A Bayesian feature allocation model (FAM) is presented for identifying cell subpopulations based on multiple samples of cell surface or intracellular marker expression level data obtained by cytometry by time of flight (CyTOF). Cell subpopulations are characterized by differences in expression patterns of makers, and individual cells are clustered into the subpopulations based on the patterns of their observed expression levels. A finite Indian buffet process is used to model subpopulations as latent features, and a model-based method based on these latent feature subpopulations is used to construct cell clusters within each sample. Non-ignorable missing data due to technical artifacts in mass cytometry instruments are accounted for by defining a static missing data mechanism. In contrast to conventional cell clustering methods based on observed marker expression levels that are applied separately to different samples, the FAM based method can be applied simultaneously to multiple samples, and can identify important cell subpopulations likely to be missed by conventional clustering. The proposed FAM based method is applied to jointly analyze three datasets, generated by CyTOF, to study natural killer (NK) cells. Because the subpopulations identified by the FAM may define novel NK cell subsets, this statistical analysis may provide useful information about the biology of NK cells and their potential role in cancer immunotherapy which may lead, in turn, to development of improved cellular therapies. Simulation studies of the proposed method's behavior under two cases of known subpopulations also are presented, followed by analysis of the CyTOF NK cell surface marker data.

[22]  arXiv:2002.08663 [pdf, ps, other]
Title: Learning Gaussian Graphical Models via Multiplicative Weights
Comments: AISTATS 2020
Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)

Graphical model selection in Markov random fields is a fundamental problem in statistics and machine learning. Two particularly prominent models, the Ising model and Gaussian model, have largely developed in parallel using different (though often related) techniques, and several practical algorithms with rigorous sample complexity bounds have been established for each. In this paper, we adapt a recently proposed algorithm of Klivans and Meka (FOCS, 2017), based on the method of multiplicative weight updates, from the Ising model to the Gaussian model, via non-trivial modifications to both the algorithm and its analysis. The algorithm enjoys a sample complexity bound that is qualitatively similar to others in the literature, has a low runtime $O(mp^2)$ in the case of $m$ samples and $p$ nodes, and can trivially be implemented in an online manner.

[23]  arXiv:2002.08724 [pdf, other]
Title: Generalized sampling with functional principal components for high-resolution random field estimation
Authors: Milana Gataric
Subjects: Statistics Theory (math.ST); Signal Processing (eess.SP); Numerical Analysis (math.NA); Machine Learning (stat.ML)

In this paper, we take a statistical approach to the problem of recovering a function from low-resolution measurements taken with respect to an arbitrary basis, by regarding the function of interest as a realization of a random field. We introduce an infinite-dimensional framework for high-resolution estimation of a random field from its low-resolution indirect measurements as well as the high-resolution measurements of training observations by merging the existing frameworks of generalized sampling and functional principal component analysis. We study the statistical performance of the resulting estimation procedure and show that high-resolution recovery is indeed possible provided appropriate low-rank and angle conditions hold and provided the training set is sufficiently large relative to the desired resolution. We also consider sparse representations of the principle components, which can reduce the required size of the training set. Furthermore, the effectiveness of the proposed procedure is investigated in various numerical examples.

[24]  arXiv:2002.08731 [pdf, other]
Title: APTER: Aggregated Prognosis Through Exponential Reweighting
Subjects: Methodology (stat.ME); Machine Learning (cs.LG)

This paper considers the task of learning how to make a prognosis of a patient based on his/her micro-array expression levels. The method is an application of the aggregation method as recently proposed in the literature on theoretical machine learning, and excels in its computational convenience and capability to deal with high-dimensional data. A formal analysis of the method is given, yielding rates of convergence similar to what traditional techniques obtain, while it is shown to cope well with an exponentially large set of features. Those results are supported by numerical simulations on a range of publicly available survival-micro-array datasets. It is empirically found that the proposed technique combined with a recently proposed preprocessing technique gives excellent performances.

[25]  arXiv:2002.08757 [pdf, other]
Title: Asymptotically Optimal Bias Reduction for Parametric Models
Comments: arXiv admin note: substantial text overlap with arXiv:1907.11541
Subjects: Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)

An important challenge in statistical analysis concerns the control of the finite sample bias of estimators. This problem is magnified in high-dimensional settings where the number of variables $p$ diverges with the sample size $n$, as well as for nonlinear models and/or models with discrete data. For these complex settings, we propose to use a general simulation-based approach and show that the resulting estimator has a bias of order $\mathcal{O}(0)$, hence providing an asymptotically optimal bias reduction. It is based on an initial estimator that can be slightly asymptotically biased, making the approach very generally applicable. This is particularly relevant when classical estimators, such as the maximum likelihood estimator, can only be (numerically) approximated. We show that the iterative bootstrap of Kuk (1995) provides a computationally efficient approach to compute this bias reduced estimator. We illustrate our theoretical results in simulation studies for which we develop new bias reduced estimators for the logistic regression, with and without random effects. These estimators enjoy additional properties such as robustness to data contamination and to the problem of separability.

[26]  arXiv:2002.08774 [pdf, ps, other]
Title: Propose, Test, Release: Differentially private estimation with high probability
Comments: arXiv admin note: text overlap with arXiv:1906.11923
Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Statistics Theory (math.ST)

We derive concentration inequalities for differentially private median and mean estimators building on the "Propose, Test, Release" (PTR) mechanism introduced by Dwork and Lei (2009). We introduce a new general version of the PTR mechanism that allows us to derive high probability error bounds for differentially private estimators. Our algorithms provide the first statistical guarantees for differentially private estimation of the median and mean without any boundedness assumptions on the data, and without assuming that the target population parameter lies in some known bounded interval. Our procedures do not rely on any truncation of the data and provide the first sub-Gaussian high probability bounds for differentially private median and mean estimation, for possibly heavy tailed random variables.

[27]  arXiv:2002.08789 [pdf, other]
Title: Consistent model selection procedure for general integer-valued time series
Subjects: Statistics Theory (math.ST)

This paper deals with the problem of model selection for a general class of integer-valued time series.
We propose a penalized criterion based on the Poisson quasi-likelihood of the model.
Under certain regularity conditions, the consistency of the procedure as well as the consistency and the asymptotic normality of the Poisson quasi-likelihood estimator of the selected model are established.
Simulation experiments are conducted for some classical models such as Poisson, binary INGARCH and negative binomial model with nonlinear dynamic. Also, an application to a real dataset is provided.

[28]  arXiv:2002.08797 [pdf, other]
Title: Pruning untrained neural networks: Principles and Analysis
Comments: 50 pages, 12 figures
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Overparameterized neural networks display state-of-the art performance. However, there is a growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources. A popular approach consists of using pruning techniques. While these techniques have traditionally focused on pruning pre-trained neural networks (e.g. LeCun et al. (1990) and Hassabi et al. (1993)), recent work by Lee et al. (2018) showed promising results where pruning is performed at initialization. However, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, these procedures do not prevent one layer being fully pruned. In this paper we provide a comprehensive theoretical analysis of pruning at initialization and training sparse architectures. This analysis allows us to propose novel principled approaches which we validate experimentally on a variety of network architectures. We particularly show that we can prune up to 99.9% of the weights while keeping the model trainable.

[29]  arXiv:2002.08853 [pdf, other]
Title: A General Pairwise Comparison Model for Extremely Sparse Networks
Comments: 27 pages, 4 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

Statistical inference using pairwise comparison data has been an effective approach to analyzing complex and sparse networks. In this paper we propose a general framework for modeling the mutual interaction in a probabilistic network, which enjoys ample flexibility in terms of parametrization. Within this set-up, we establish that the maximum likelihood estimator (MLE) for the latent scores of the subjects is uniformly consistent under a near-minimal condition on network sparsity. This condition is sharp in terms of the leading order asymptotics describing the sparsity. The proof utilizes a novel chaining technique based on the error-induced metric as well as careful counting of comparison graph structures. Our results guarantee that the MLE is a valid estimator for inference in large-scale comparison networks where data is asymptotically deficient. Numerical simulations are provided to complement the theoretical analysis.

[30]  arXiv:2002.08871 [pdf, other]
Title: Fast Differentiable Sorting and Ranking
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

The sorting operation is one of the most basic and commonly used building blocks in computer programming. In machine learning, it is commonly used for robust statistics. However, seen as a function, it is piecewise linear and as a result includes many kinks at which it is non-differentiable. More problematic is the related ranking operator, commonly used for order statistics and ranking metrics. It is a piecewise constant function, meaning that its derivatives are null or undefined. While numerous works have proposed differentiable proxies to sorting and ranking, they do not achieve the $O(n \log n)$ time complexity one would expect from sorting and ranking operations. In this paper, we propose the first differentiable sorting and ranking operators with $O(n \log n)$ time and $O(n)$ space complexity. Our proposal in addition enjoys exact computation and differentiation. We achieve this feat by constructing differentiable sorting and ranking operators as projections onto the permutahedron, the convex hull of permutations, and using a reduction to isotonic optimization. Empirically, we confirm that our approach is an order of magnitude faster than existing approaches and showcase two novel applications: differentiable Spearman's rank correlation coefficient and soft least trimmed squares.

[31]  arXiv:2002.08943 [pdf, other]
Title: Implicit differentiation of Lasso-type models for hyperparameter optimization
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Setting regularization parameters for Lasso-type estimators is notoriously difficult, though crucial in practice. The most popular hyperparameter optimization approach is grid-search using held-out validation data. Grid-search however requires to choose a predefined grid for each parameter, which scales exponentially in the number of parameters. Another approach is to cast hyperparameter optimization as a bi-level optimization problem, one can solve by gradient descent. The key challenge for these methods is the estimation of the gradient with respect to the hyperparameters. Computing this gradient via forward or backward automatic differentiation is possible yet usually suffers from high memory consumption. Alternatively implicit differentiation typically involves solving a linear system which can be prohibitive and numerically unstable in high dimension. In addition, implicit differentiation usually assumes smooth loss functions, which is not the case for Lasso-type problems. This work introduces an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems. Our approach scales to high-dimensional data by leveraging the sparsity of the solutions. Experiments demonstrate that the proposed method outperforms a large number of standard methods to optimize the error on held-out data, or the Stein Unbiased Risk Estimator (SURE).

[32]  arXiv:2002.08948 [pdf, other]
Title: I-SPEC: An End-to-End Framework for Learning Transportable, Shift-Stable Models
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Shifts in environment between development and deployment cause classical supervised learning to produce models that fail to generalize well to new target distributions. Recently, many solutions which find invariant predictive distributions have been developed. Among these, graph-based approaches do not require data from the target environment and can capture more stable information than alternative methods which find stable feature sets. However, these approaches assume that the data generating process is known in the form of a full causal graph, which is generally not the case. In this paper, we propose I-SPEC, an end-to-end framework that addresses this shortcoming by using data to learn a partial ancestral graph (PAG). Using the PAG we develop an algorithm that determines an interventional distribution that is stable to the declared shifts; this subsumes existing approaches which find stable feature sets that are less accurate. We apply I-SPEC to a mortality prediction problem to show it can learn a model that is robust to shifts without needing upfront knowledge of the full causal DAG.

Cross-lists for Fri, 21 Feb 20

[33]  arXiv:2002.05160 (cross-list from cs.DS) [pdf, other]
Title: Optimal Multiple Stopping Rule for Warm-Starting Sequential Selection
Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

In this paper we present the Warm-starting Dynamic Thresholding algorithm, developed using dynamic programming, for a variant of the standard online selection problem. The problem allows job positions to be either free or already occupied at the beginning of the process. Throughout the selection process, the decision maker interviews one after the other the new candidates and reveals a quality score for each of them. Based on that information, she can (re)assign each job at most once by taking immediate and irrevocable decisions. We relax the hard requirement of the class of dynamic programming algorithms to perfectly know the distribution from which the scores of candidates are drawn, by presenting extensions for the partial and no-information cases, in which the decision maker can learn the underlying score distribution sequentially while interviewing candidates.

[34]  arXiv:2002.08356 (cross-list from physics.med-ph) [pdf, other]
Title: Comparative Visual Analytics for Assessing Medical Records with Sequence Embedding
Comments: This manuscript is currently under review
Subjects: Medical Physics (physics.med-ph); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)

Machine learning for data-driven diagnosis has been actively studied in medicine to provide better healthcare. Supporting analysis of a patient cohort similar to a patient under treatment is a key task for clinicians to make decisions with high confidence. However, such analysis is not straightforward due to the characteristics of medical records: high dimensionality, irregularity in time, and sparsity. To address this challenge, we introduce a method for similarity calculation of medical records. Our method employs event and sequence embeddings. While we use an autoencoder for the event embedding, we apply its variant with the self-attention mechanism for the sequence embedding. Moreover, in order to better handle the irregularity of data, we enhance the self-attention mechanism with consideration of different time intervals. We have developed a visual analytics system to support comparative studies of patient records. To make a comparison of sequences with different lengths easier, our system incorporates a sequence alignment method. Through its interactive interface, the user can quickly identify patients of interest and conveniently review both the temporal and multivariate aspects of the patient records. We demonstrate the effectiveness of our design and system with case studies using a real-world dataset from the neonatal intensive care unit of UC Davis.

[35]  arXiv:2002.08396 (cross-list from cs.LG) [pdf, other]
Title: Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning
Comments: To appear in ICLR 2020
Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)

Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set (batch) of environment interactions is available and no new experience can be acquired. This property makes these algorithms appealing for real world problems such as robot control. In practice, however, standard off-policy algorithms fail in the batch setting for continuous control. In this paper, we propose a simple solution to this problem. It admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Our method can be seen as an extension of recent work on batch-RL that enables stable learning from conflicting data-sources. We find improvements on competitive baselines in a variety of RL tasks -- including standard continuous control benchmarks and multi-task learning for simulated and real-world robots.

[36]  arXiv:2002.08405 (cross-list from cs.LG) [pdf, other]
Title: Warm Starting Bandits with Side Information from Confounded Data
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We study a variant of the multi-armed bandit problem where side information in the form of bounds on the mean of each arm is provided. We describe how these bounds on the means can be used efficiently for warm starting bandits. Specifically, we propose the novel UCB-SI algorithm, and illustrate improvements in cumulative regret over the standard UCB algorithm, both theoretically and empirically, in the presence of non-trivial side information. As noted in (Zhang & Bareinboim, 2017), such information arises, for instance, when we have prior logged data on the arms, but this data has been collected under a policy whose choice of arms is based on latent variables to which access is no longer available. We further provide a novel approach for obtaining such bounds from prior partially confounded data under some mild assumptions. We validate our findings through semi-synthetic experiments on data derived from real datasets.

[37]  arXiv:2002.08423 (cross-list from cs.LG) [pdf, other]
Title: PrivacyFL: A simulator for privacy-preserving and secure federated learning
Comments: 15 pages
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)

Federated learning is a technique that enables distributed clients to collaboratively learn a shared machine learning model while keeping their training data localized. This reduces data privacy risks, however, privacy concerns still exist since it is possible to leak information about the training dataset from the trained model's weights or parameters. Setting up a federated learning environment, especially with security and privacy guarantees, is a time-consuming process with numerous configurations and parameters that can be manipulated. In order to help clients ensure that collaboration is feasible and to check that it improves their model accuracy, a real-world simulator for privacy-preserving and secure federated learning is required.
In this paper, we introduce PrivacyFL, which is an extensible, easily configurable and scalable simulator for federated learning environments. Its key features include latency simulation, robustness to client departure, support for both centralized and decentralized learning, and configurable privacy and security mechanisms based on differential privacy and secure multiparty computation.
In this paper, we motivate our research, describe the architecture of the simulator and associated protocols, and discuss its evaluation in numerous scenarios that highlight its wide range of functionality and its advantages. Our paper addresses a significant real-world problem: checking the feasibility of participating in a federated learning environment under a variety of circumstances. It also has a strong practical impact because organizations such as hospitals, banks, and research institutes, which have large amounts of sensitive data and would like to collaborate, would greatly benefit from having a system that enables them to do so in a privacy-preserving and secure manner.

[38]  arXiv:2002.08456 (cross-list from cs.GT) [pdf, other]
Title: From Poincaré Recurrence to Convergence in Imperfect Information Games: Finding Equilibrium via Regularization
Comments: 43 pages
Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)

In this paper we investigate the Follow the Regularized Leader dynamics in sequential imperfect information games (IIG). We generalize existing results of Poincar\'e recurrence from normal-form games to zero-sum two-player imperfect information games and other sequential game settings. We then investigate how adapting the reward (by adding a regularization term) of the game can give strong convergence guarantees in monotone games. We continue by showing how this reward adaptation technique can be leveraged to build algorithms that converge exactly to the Nash equilibrium. Finally, we show how these insights can be directly used to build state-of-the-art model-free algorithms for zero-sum two-player Imperfect Information Games (IIG).

[39]  arXiv:2002.08483 (cross-list from cs.LG) [pdf, other]
Title: Strength from Weakness: Fast Learning Using Weak Supervision
Comments: 21 pages, 8 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We study generalization properties of weakly supervised learning. That is, learning where only a few "strong" labels (the actual target of our prediction) are present but many more "weak" labels are available. In particular, we show that having access to weak labels can significantly accelerate the learning rate for the strong task to the fast rate of $\mathcal{O}(\nicefrac1n)$, where $n$ denotes the number of strongly labeled data points. This acceleration can happen even if by itself the strongly labeled data admits only the slower $\mathcal{O}(\nicefrac{1}{\sqrt{n}})$ rate. The actual acceleration depends continuously on the number of weak labels available, and on the relation between the two tasks. Our theoretical results are reflected empirically across a range of tasks and illustrate how weak labels speed up learning on the strong task.

[40]  arXiv:2002.08484 (cross-list from cs.LG) [pdf, other]
Title: Estimating Training Data Influence by Tracking Gradient Descent
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We introduce a method called TrackIn that computes the influence of a training example on a prediction made by the model, by tracking how the loss on the test point changes during the training process whenever the training example of interest was utilized. We provide a scalable implementation of TrackIn via a combination of a few key ideas: (a) a first-order approximation to the exact computation, (b) using random projections to speed up the computation of the first-order approximation for large models, (c) using saved checkpoints of standard training procedures, and (d) cherry-picking layers of a deep neural network. An experimental evaluation shows that TrackIn is more effective in identifying mislabelled training examples than other related methods such as influence functions and representer points. We also discuss insights from applying the method on vision, regression and natural language tasks.

[41]  arXiv:2002.08491 (cross-list from math.NA) [pdf, other]
Title: Entrywise convergence of iterative methods for eigenproblems
Comments: 22 pages, 6 figures
Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)

Several problems in machine learning, statistics, and other fields rely on computing eigenvectors. For large scale problems, the computation of these eigenvectors is typically performed via iterative schemes such as subspace iteration or Krylov methods. While there is classical and comprehensive analysis for subspace convergence guarantees with respect to the spectral norm, in many modern applications other notions of subspace distance are more appropriate. Recent theoretical work has focused on perturbations of subspaces measured in the $\ell_{2 \to \infty}$ norm, but does not consider the actual computation of eigenvectors. Here we address the convergence of subspace iteration when distances are measured in the $\ell_{2 \to \infty}$ norm and provide deterministic bounds. We complement our analysis with a practical stopping criterion and demonstrate its applicability via numerical experiments. Our results show that one can get comparable performance on downstream tasks while requiring fewer iterations, thereby saving substantial computational time.

[42]  arXiv:2002.08517 (cross-list from cs.LG) [pdf, other]
Title: Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks
Comments: 18 pages, 9 figures, 2 tables
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Analysing and computing with Gaussian processes arising from infinitely wide neural networks has recently seen a resurgence in popularity. Despite this, many explicit covariance functions of networks with activation functions used in modern networks remain unknown. Furthermore, while the kernels of deep networks can be computed iteratively, theoretical understanding of deep kernels is lacking, particularly with respect to fixed-point dynamics. Firstly, we derive the covariance functions of MLPs with exponential linear units and Gaussian error linear units and evaluate the performance of the limiting Gaussian processes on some benchmarks. Secondly, and more generally, we introduce a framework for analysing the fixed-point dynamics of iterated kernels corresponding to a broad range of activation functions. We find that unlike some previously studied neural network kernels, these new kernels exhibit non-trivial fixed-point dynamics which are mirrored in finite-width neural networks.

[43]  arXiv:2002.08526 (cross-list from cs.LG) [pdf, other]
Title: Scalable Constrained Bayesian Optimization
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

The global optimization of a high-dimensional black-box function under black-box constraints is a pervasive task in machine learning, control, and engineering. These problems are difficult since the feasible set is typically non-convex and hard to find, in addition to the curses of dimensionality and the heterogeneity of the underlying functions. In particular, these characteristics dramatically impact the performance of Bayesian optimization methods, that otherwise have become the de-facto standard for sample-efficient optimization in unconstrained settings. Due to the lack of sample-efficient methods, practitioners usually fall back to evolutionary strategies or heuristics. We propose the scalable constrained Bayesian optimization (SCBO) algorithm that addresses the above challenges by data-independent transformations of the functions and follows the recent theme of local Bayesian optimization. A comprehensive experimental evaluation demonstrates that SCBO achieves excellent results and outperforms the state-of-the-art methods.

[44]  arXiv:2002.08528 (cross-list from cs.LG) [pdf, other]
Title: Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

We study distributed optimization algorithms for minimizing the average of \emph{heterogeneous} functions distributed across several machines with a focus on communication efficiency. In such settings, naively using the classical stochastic gradient descent (SGD) or its variants (e.g., SVRG) with a uniform sampling of machines typically yields poor performance. It often leads to the dependence of convergence rate on maximum Lipschitz constant of gradients across the devices. In this paper, we propose a novel \emph{adaptive} sampling of machines specially catered to these settings. Our method relies on an adaptive estimate of local Lipschitz constants base on the information of past gradients. We show that the new way improves the dependence of convergence rate from maximum Lipschitz constant to \emph{average} Lipschitz constant across machines, thereby, significantly accelerating the convergence. Our experiments demonstrate that our method indeed speeds up the convergence of the standard SVRG algorithm in heterogeneous environments.

[45]  arXiv:2002.08536 (cross-list from cs.LG) [pdf, other]
Title: Safe Counterfactual Reinforcement Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)

We develop a method for predicting the performance of reinforcement learning and bandit algorithms, given historical data that may have been generated by a different algorithm. Our estimator has the property that its prediction converges in probability to the true performance of a counterfactual algorithm at the fast $\sqrt{N}$ rate, as the sample size $N$ increases. We also show a correct way to estimate the variance of our prediction, thus allowing the analyst to quantify the uncertainty in the prediction. These properties hold even when the analyst does not know which among a large number of potentially important state variables are really important. These theoretical guarantees make our estimator safe to use. We finally apply it to improve advertisement design by a major advertisement company. We find that our method produces smaller mean squared errors than state-of-the-art methods.

[46]  arXiv:2002.08537 (cross-list from math.OC) [pdf, other]
Title: Adaptive Temporal Difference Learning with Linear Function Approximation
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)

This paper revisits the celebrated temporal difference (TD) learning algorithm for the policy evaluation in reinforcement learning. Typically, the performance of the plain-vanilla TD algorithm is sensitive to the choice of stepsizes. Oftentimes, TD suffers from slow convergence. Motivated by the tight connection between the TD learning algorithm and the stochastic gradient methods, we develop the first adaptive variant of the TD learning algorithm with linear function approximation that we term AdaTD. In contrast to the original TD, AdaTD is robust or less sensitive to the choice of stepsizes. Analytically, we establish that to reach an $\epsilon$ accuracy, the number of iterations needed is $\tilde{O}(\epsilon^2\ln^4\frac{1}{\epsilon}/\ln^4\frac{1}{\rho})$, where $\rho$ represents the speed of the underlying Markov chain converges to the stationary distribution. This implies that the iteration complexity of AdaTD is no worse than that of TD in the worst case. Going beyond TD, we further develop an adaptive variant of TD($\lambda$), which is referred to as AdaTD($\lambda$). We evaluate the empirical performance of AdaTD and AdaTD($\lambda$) on several standard reinforcement learning tasks in OpenAI Gym on both linear and nonlinear function approximation, which demonstrate the effectiveness of our new approaches over existing ones.

[47]  arXiv:2002.08538 (cross-list from cs.LG) [pdf, other]
Title: Non-asymptotic and Accurate Learning of Nonlinear Dynamical Systems
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Applications (stat.AP); Machine Learning (stat.ML)

We consider the problem of learning stabilizable systems governed by nonlinear state equation $h_{t+1}=\phi(h_t,u_t;\theta)+w_t$. Here $\theta$ is the unknown system dynamics, $h_t $ is the state, $u_t$ is the input and $w_t$ is the additive noise vector. We study gradient based algorithms to learn the system dynamics $\theta$ from samples obtained from a single finite trajectory. If the system is run by a stabilizing input policy, we show that temporally-dependent samples can be approximated by i.i.d. samples via a truncation argument by using mixing-time arguments. We then develop new guarantees for the uniform convergence of the gradients of empirical loss. Unlike existing work, our bounds are noise sensitive which allows for learning ground-truth dynamics with high accuracy and small sample complexity. Together, our results facilitate efficient learning of the general nonlinear system under stabilizing policy. We specialize our guarantees to entry-wise nonlinear activations and verify our theory in various numerical experiments

[48]  arXiv:2002.08567 (cross-list from cs.LG) [pdf, other]
Title: Multi-Agent Meta-Reinforcement Learning for Self-Powered and Sustainable Edge Computing Systems
Comments: Submitted to IEEE Transactions on Network and Service Management
Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Signal Processing (eess.SP); Machine Learning (stat.ML)

The stringent requirements of mobile edge computing (MEC) applications and functions fathom the high capacity and dense deployment of MEC hosts to the upcoming wireless networks. However, operating such high capacity MEC hosts can significantly increase energy consumption. Thus, a BS unit can act as a self-powered BS. In this paper, an effective energy dispatch mechanism for self-powered wireless networks with edge computing capabilities is studied. First, a two-stage linear stochastic programming problem is formulated with the goal of minimizing the total energy consumption cost of the system while fulfilling the energy demand. Second, a semi-distributed data-driven solution is proposed by developing a novel multi-agent meta-reinforcement learning (MAMRL) framework to solve the formulated problem. In particular, each BS plays the role of a local agent that explores a Markovian behavior for both energy consumption and generation while each BS transfers time-varying features to a meta-agent. Sequentially, the meta-agent optimizes (i.e., exploits) the energy dispatch decision by accepting only the observations from each local agent with its own state information. Meanwhile, each BS agent estimates its own energy dispatch policy by applying the learned parameters from meta-agent. Finally, the proposed MAMRL framework is benchmarked by analyzing deterministic, asymmetric, and stochastic environments in terms of non-renewable energy usages, energy cost, and accuracy. Experimental results show that the proposed MAMRL model can reduce up to 11% non-renewable energy usage and by 22.4% the energy cost (with 95.8% prediction accuracy), compared to other baseline methods.

[49]  arXiv:2002.08570 (cross-list from cs.LG) [pdf, other]
Title: Input Perturbation: A New Paradigm between Central and Local Differential Privacy
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Traditionally, there are two models on differential privacy: the central model and the local model. The central model focuses on the machine learning model and the local model focuses on the training data. In this paper, we study the \textit{input perturbation} method in differentially private empirical risk minimization (DP-ERM), preserving privacy of the central model. By adding noise to the original training data and training with the `perturbed data', we achieve ($\epsilon$,$\delta$)-differential privacy on the final model, along with some kind of privacy on the original data. We observe that there is an interesting connection between the local model and the central model: the perturbation on the original data causes the perturbation on the gradient, and finally the model parameters. This observation means that our method builds a bridge between local and central model, protecting the data, the gradient and the model simultaneously, which is more superior than previous central methods. Detailed theoretical analysis and experiments show that our method achieves almost the same (or even better) performance as some of the best previous central methods with more protections on privacy, which is an attractive result. Moreover, we extend our method to a more general case: the loss function satisfies the Polyak-Lojasiewicz condition, which is more general than strong convexity, the constraint on the loss function in most previous work.

[50]  arXiv:2002.08578 (cross-list from cs.LG) [pdf, other]
Title: Differentially Private ERM Based on Data Perturbation
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In this paper, after observing that different training data instances affect the machine learning model to different extents, we attempt to improve the performance of differentially private empirical risk minimization (DP-ERM) from a new perspective. Specifically, we measure the contributions of various training data instances on the final machine learning model, and select some of them to add random noise. Considering that the key of our method is to measure each data instance separately, we propose a new `Data perturbation' based (DB) paradigm for DP-ERM: adding random noise to the original training data and achieving ($\epsilon,\delta$)-differential privacy on the final machine learning model, along with the preservation on the original data. By introducing the Influence Function (IF), we quantitatively measure the impact of the training data on the final model. Theoretical and experimental results show that our proposed DBDP-ERM paradigm enhances the model performance significantly.

[51]  arXiv:2002.08583 (cross-list from cs.LG) [pdf, other]
Title: Regret Minimization in Stochastic Contextual Dueling Bandits
Comments: 28 pages, 11 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We consider the problem of stochastic $K$-armed dueling bandit in the contextual setting, where at each round the learner is presented with a context set of $K$ items, each represented by a $d$-dimensional feature vector, and the goal of the learner is to identify the best arm of each context sets. However, unlike the classical contextual bandit setup, our framework only allows the learner to receive item feedback in terms of their (noisy) pariwise preferences--famously studied as dueling bandits which is practical interests in various online decision making scenarios, e.g. recommender systems, information retrieval, tournament ranking, where it is easier to elicit the relative strength of the items instead of their absolute scores. However, to the best of our knowledge this work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal algorithms along with a matching lower bound analysis. We present two algorithms for the setup with respective regret guarantees $\tilde O(d\sqrt{T})$ and $\tilde O(\sqrt{dT \log K})$. Subsequently we also show that $\Omega(\sqrt {dT})$ is actually the fundamental performance limit for this problem, implying the optimality of our second algorithm. However the analysis of our first algorithm is comparatively simpler, and it is often shown to outperform the former empirically. Finally, we corroborate all the theoretical results with suitable experiments.

[52]  arXiv:2002.08595 (cross-list from cs.CV) [pdf, other]
Title: KaoKore: A Pre-modern Japanese Art Facial Expression Dataset
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

From classifying handwritten digits to generating strings of text, the datasets which have received long-time focus from the machine learning community vary greatly in their subject matter. This has motivated a renewed interest in building datasets which are socially and culturally relevant, so that algorithmic research may have a more direct and immediate impact on society. One such area is in history and the humanities, where better and relevant machine learning models can accelerate research across various fields. To this end, newly released benchmarks and models have been proposed for transcribing historical Japanese cursive writing, yet for the field as a whole using machine learning for historical Japanese artworks still remains largely uncharted. To bridge this gap, in this work we propose a new dataset KaoKore which consists of faces extracted from pre-modern Japanese artwork. We demonstrate its value as both a dataset for image classification as well as a creative and artistic dataset, which we explore using generative models. Dataset available at https://github.com/rois-codh/kaokore

[53]  arXiv:2002.08596 (cross-list from cs.LG) [pdf]
Title: Interpretability of machine learning based prediction models in healthcare
Comments: 12 pages, 2 figures, submitted to Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

There is a need of ensuring machine learning models that are interpretable. Higher interpretability of the model means easier comprehension and explanation of future predictions for end-users. Further, interpretable machine learning models allow healthcare experts to make reasonable and data-driven decisions to provide personalized decisions that can ultimately lead to higher quality of service in healthcare. Generally, we can classify interpretability approaches in two groups where the first focuses on personalized interpretation (local interpretability) while the second summarizes prediction models on a population level (global interpretability). Alternatively, we can group interpretability methods into model-specific techniques, which are designed to interpret predictions generated by a specific model, such as a neural network, and model-agnostic approaches, which provide easy-to-understand explanations of predictions made by any machine learning model. Here, we give an overview of interpretability approaches and provide examples of practical interpretability of machine learning in different areas of healthcare, including prediction of health-related outcomes, optimizing treatments or improving the efficiency of screening for specific conditions. Further, we outline future directions for interpretable machine learning and highlight the importance of developing algorithmic solutions that can enable machine-learning driven decision making in high-stakes healthcare problems.

[54]  arXiv:2002.08597 (cross-list from eess.SP) [pdf, ps, other]
Title: Kalman Filtering With Censored Measurements
Comments: 14 pages, 3 figures
Subjects: Signal Processing (eess.SP); Methodology (stat.ME)

This paper concerns Kalman filtering when the measurements of the process are censored. The censored measurements are addressed by the Tobit model of Type I and are one-dimensional with two censoring limits, while the (hidden) state vectors are multidimensional. For this model, Bayesian estimates for the state vectors are provided through a recursive algorithm of Kalman filtering type. Experiments are presented to illustrate the effectiveness and applicability of the algorithm. The experiments show that the proposed method outperforms other filtering methodologies in minimizing the computational cost as well as the overall Root Mean Square Error (RMSE) for synthetic and real data sets.

[55]  arXiv:2002.08599 (cross-list from cs.LG) [pdf, other]
Title: On Learning Sets of Symmetric Elements
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Learning from unordered sets is a fundamental learning setup, which is attracting increasing attention. Research in this area has focused on the case where elements of the set are represented by feature vectors, and far less emphasis has been given to the common case where set elements themselves adhere to certain symmetries. That case is relevant to numerous applications, from deblurring image bursts to multi-view 3D shape recognition and reconstruction.
In this paper, we present a principled approach to learning sets of general symmetric elements. We first characterize the space of linear layers that are equivariant both to element reordering and to the inherent symmetries of elements, like translation in the case of images. We further show that networks that are composed of these layers, called Deep Sets for Symmetric elements layers (DSS), are universal approximators of both invariant and equivariant functions. DSS layers are also straightforward to implement. Finally, we show that they improve over existing set-learning architectures in a series of experiments with images, graphs, and point-clouds.

[56]  arXiv:2002.08605 (cross-list from cs.LG) [pdf, other]
Title: Optimizing Black-box Metrics with Adaptive Surrogates
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

We address the problem of training models with black-box and hard-to-optimize metrics by expressing the metric as a monotonic function of a small number of easy-to-optimize surrogates. We pose the training problem as an optimization over a relaxed surrogate space, which we solve by estimating local gradients for the metric and performing inexact convex projections. We analyze gradient estimates based on finite differences and local linear interpolations, and show convergence of our approach under smoothness assumptions with respect to the surrogates. Experimental results on classification and ranking problems verify the proposal performs on par with methods that know the mathematical formulation, and adds notable value when the form of the metric is unknown.

[57]  arXiv:2002.08616 (cross-list from cs.LG) [pdf, other]
Title: Diversity sampling is an implicit regularization for kernel methods
Comments: 27 pages
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Kernel methods have achieved very good performance on large scale regression and classification problems, by using the Nystr\"om method and preconditioning techniques. The Nystr\"om approximation -- based on a subset of landmarks -- gives a low rank approximation of the kernel matrix, and is known to provide a form of implicit regularization. We further elaborate on the impact of sampling diverse landmarks for constructing the Nystr\"om approximation in supervised as well as unsupervised kernel methods. By using Determinantal Point Processes for sampling, we obtain additional theoretical results concerning the interplay between diversity and regularization. Empirically, we demonstrate the advantages of training kernel methods based on subsets made of diverse points. In particular, if the dataset has a dense bulk and a sparser tail, we show that Nystr\"om kernel regression with diverse landmarks increases the accuracy of the regression in sparser regions of the dataset, with respect to a uniform landmark sampling. A greedy heuristic is also proposed to select diverse samples of significant size within large datasets when exact DPP sampling is not practically feasible.

[58]  arXiv:2002.08619 (cross-list from cs.LG) [pdf, other]
Title: Boosting Adversarial Training with Hypersphere Embedding
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Adversarial training (AT) is one of the most effective defenses to improve the adversarial robustness of deep learning models. In order to promote the reliability of the adversarially trained models, we propose to boost AT via incorporating hypersphere embedding (HE), which can regularize the adversarial features onto compact hypersphere manifolds. We formally demonstrate that AT and HE are well coupled, which tunes up the learning dynamics of AT from several aspects. We comprehensively validate the effectiveness and universality of HE by embedding it into the popular AT frameworks including PGD-AT, ALP, and TRADES, as well as the FreeAT and FastAT strategies. In experiments, we evaluate our methods on the CIFAR-10 and ImageNet datasets, and verify that integrating HE can consistently enhance the performance of the models trained by each AT framework with little extra computation.

[59]  arXiv:2002.08621 (cross-list from cs.LG) [pdf, other]
Title: The Benefits of Pairwise Discriminators for Adversarial Training
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Adversarial training methods typically align distributions by solving two-player games. However, in most current formulations, even if the generator aligns perfectly with data, a sub-optimal discriminator can still drive the two apart. Absent additional regularization, the instability can manifest itself as a never-ending game. In this paper, we introduce a family of objectives by leveraging pairwise discriminators, and show that only the generator needs to converge. The alignment, if achieved, would be preserved with any discriminator. We provide sufficient conditions for local convergence; characterize the capacity balance that should guide the discriminator and generator choices; and construct examples of minimally sufficient discriminators. Empirically, we illustrate the theory and the effectiveness of our approach on synthetic examples. Moreover, we show that practical methods derived from our approach can better generate higher-resolution images.

[60]  arXiv:2002.08641 (cross-list from cs.LG) [pdf]
Title: A Novel Framework for Selection of GANs for an Application
Comments: 23 pages, 1 figures, 7 tables
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Generative Adversarial Network (GAN) is a current focal point of research. The body of knowledge is fragmented, leading to a trial-error method while selecting an appropriate GAN for a given scenario. We provide a comprehensive summary of the evolution of GANs starting from its inception addressing issues like mode collapse, vanishing gradient, unstable training and non-convergence. We also provide a comparison of various GANs from the application point of view, its behaviour and implementation details. We propose a novel framework to identify candidate GANs for a specific use case based on architecture, loss, regularization and divergence. We also discuss application of the framework using an example, and we demonstrate a significant reduction in search space. This efficient way to determine potential GANs lowers unit economics of AI development for organizations.

[61]  arXiv:2002.08643 (cross-list from cs.LG) [pdf, other]
Title: Embedding Graph Auto-Encoder with Joint Clustering via Adjacency Sharing
Comments: 11 pages containing appendix
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Graph convolution networks have attracted many attentions and several graph auto-encoder based clustering models are developed for attributed graph clustering. However, most existing approaches separate clustering and optimization of graph auto-encoder into two individual steps. In this paper, we propose a graph convolution network based clustering model, namely, Embedding Graph Auto-Encoder with JOint Clustering via Adjacency Sharing (\textit{EGAE-JOCAS}). As for the embedded model, we develop a novel joint clustering method, which combines relaxed k-means and spectral clustering and is applicable for the learned embedding. The proposed joint clustering shares the same adjacency within graph convolution layers. Two parts are optimized simultaneously through performing SGD and taking close-form solutions alternatively to ensure a rapid convergence. Moreover, our model is free to incorporate any mechanisms (e.g., attention) into graph auto-encoder. Extensive experiments are conducted to prove the superiority of EGAE-JOCAS. Sufficient theoretical analyses are provided to support the results.

[62]  arXiv:2002.08645 (cross-list from cs.LG) [pdf, other]
Title: Uncovering Coresets for Classification With Multi-Objective Evolutionary Algorithms
Comments: 9 pages, 3 figures, conference. Submitted to ICML 2020
Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

A coreset is a subset of the training set, using which a machine learning algorithm obtains performances similar to what it would deliver if trained over the whole original data. Coreset discovery is an active and open line of research as it allows improving training speed for the algorithms and may help human understanding the results. Building on previous works, a novel approach is presented: candidate corsets are iteratively optimized, adding and removing samples. As there is an obvious trade-off between limiting training size and quality of the results, a multi-objective evolutionary algorithm is used to minimize simultaneously the number of points in the set and the classification error. Experimental results on non-trivial benchmarks show that the proposed approach is able to deliver results that allow a classifier to obtain lower error and better ability of generalizing on unseen data than state-of-the-art coreset discovery techniques.

[63]  arXiv:2002.08648 (cross-list from cs.LG) [pdf, other]
Title: Adaptive Graph Auto-Encoder for General Data Clustering
Comments: 11 pages containing one page supplementary
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Graph based clustering plays an important role in clustering area. Recent studies about graph convolution neural networks have achieved impressive success on graph type data. However, in traditional clustering tasks, the graph structure of data does not exist such that the strategy to construct graph is crucial for performance. In addition, the existing graph auto-encoder based approaches perform poorly on weighted graph, which is widely used in graph based clustering. In this paper, we propose a graph auto-encoder with local structure preserving for general data clustering, which can update the constructed graph adaptively. The adaptive process is designed to utilize the non-Euclidean structure sufficiently. By combining generative model for graph embedding and graph based clustering, a graph auto-encoder with a novel decoder is developed and it performs well in weighted graph used scenarios. Extensive experiments prove the superiority of our model.

[64]  arXiv:2002.08665 (cross-list from cs.LG) [pdf, other]
Title: Computationally Tractable Riemannian Manifolds for Graph Embeddings
Comments: Submitted to International Conference on Machine Learning (ICML) 2020
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Representing graphs as sets of node embeddings in certain curved Riemannian manifolds has recently gained momentum in machine learning due to their desirable geometric inductive biases, e.g., hierarchical structures benefit from hyperbolic geometry. However, going beyond embedding spaces of constant sectional curvature, while potentially more representationally powerful, proves to be challenging as one can easily lose the appeal of computationally tractable tools such as geodesic distances or Riemannian gradients. Here, we explore computationally efficient matrix manifolds, showcasing how to learn and optimize graph embeddings in these Riemannian spaces. Empirically, we demonstrate consistent improvements over Euclidean geometry while often outperforming hyperbolic and elliptical embeddings based on various metrics that capture different graph properties. Our results serve as new evidence for the benefits of non-Euclidean embeddings in machine learning pipelines.

[65]  arXiv:2002.08675 (cross-list from cs.LG) [pdf, other]
Title: Unsupervised Domain Adaptation via Discriminative Manifold Embedding and Alignment
Comments: Accepted to AAAI 2020
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Unsupervised domain adaptation is effective in leveraging the rich information from the source domain to the unsupervised target domain. Though deep learning and adversarial strategy make an important breakthrough in the adaptability of features, there are two issues to be further explored. First, the hard-assigned pseudo labels on the target domain are risky to the intrinsic data structure. Second, the batch-wise training manner in deep learning limits the description of the global structure. In this paper, a Riemannian manifold learning framework is proposed to achieve transferability and discriminability consistently. As to the first problem, this method establishes a probabilistic discriminant criterion on the target domain via soft labels. Further, this criterion is extended to a global approximation scheme for the second issue; such approximation is also memory-saving. The manifold metric alignment is exploited to be compatible with the embedding space. A theoretical error bound is derived to facilitate the alignment. Extensive experiments have been conducted to investigate the proposal and results of the comparison study manifest the superiority of consistent manifold learning framework.

[66]  arXiv:2002.08676 (cross-list from cs.LG) [pdf, other]
Title: Learning with Differentiable Perturbed Optimizers
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Machine learning pipelines often rely on optimization procedures to make discrete decisions (e.g. sorting, picking closest neighbors, finding shortest paths or optimal matchings). Although these discrete decisions are easily computed in a forward manner, they cannot be used to modify model parameters using first-order optimization techniques because they break the back-propagation of computational graphs. In order to expand the scope of learning problems that can be solved in an end-to-end fashion, we propose a systematic method to transform a block that outputs an optimal discrete decision into a differentiable operation. Our approach relies on stochastic perturbations of these parameters, and can be used readily within existing solvers without the need for ad hoc regularization or smoothing. These perturbed optimizers yield solutions that are differentiable and never locally constant. The amount of smoothness can be tuned via the chosen noise amplitude, whose impact we analyze. The derivatives of these perturbed solvers can be evaluated efficiently. We also show how this framework can be connected to a family of losses developed in structured prediction, and describe how these can be used in unsupervised and supervised learning, with theoretical guarantees. We demonstrate the performance of our approach on several machine learning tasks in experiments on synthetic and real data.

[67]  arXiv:2002.08681 (cross-list from cs.LG) [pdf, other]
Title: Unsupervised Multi-Class Domain Adaptation: Theory, Algorithms, and Practice
Comments: The journal manuscript extended significantly from our preliminary CVPR conference paper. Codes are available at: this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

In this paper, we study the formalism of unsupervised multi-class domain adaptation (multi-class UDA), which underlies some recent algorithms whose learning objectives are only motivated empirically. A Multi-Class Scoring Disagreement (MCSD) divergence is presented by aggregating the absolute margin violations in multi-class classification; the proposed MCSD is able to fully characterize the relations between any pair of multi-class scoring hypotheses. By using MCSD as a measure of domain distance, we develop a new domain adaptation bound for multi-class UDA as well as its data-dependent, probably approximately correct bound, which naturally suggest adversarial learning objectives to align conditional feature distributions across the source and target domains. Consequently, an algorithmic framework of Multi-class Domain-adversarial learning Networks (McDalNets) is developed, whose different instantiations via surrogate learning objectives either coincide with or resemble a few of recently popular methods, thus (partially) underscoring their practical effectiveness. Based on our same theory of multi-class UDA, we also introduce a new algorithm of Domain-Symmetric Networks (SymmNets), which is featured by a novel adversarial strategy of domain confusion and discrimination. SymmNets afford simple extensions that work equally well under the problem settings of either closed set, partial, or open set UDA. We conduct careful empirical studies to compare different algorithms of McDalNets and our newly introduced SymmNets. Experiments verify our theoretical analysis and show the efficacy of our proposed SymmNets. We make our implementation codes publicly available.

[68]  arXiv:2002.08695 (cross-list from cs.LG) [pdf, other]
Title: Stochastic Optimization for Regularized Wasserstein Estimators
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Optimal transport is a foundational problem in optimization, that allows to compare probability distributions while taking into account geometric aspects. Its optimal objective value, the Wasserstein distance, provides an important loss between distributions that has been used in many applications throughout machine learning and statistics. Recent algorithmic progress on this problem and its regularized versions have made these tools increasingly popular. However, existing techniques require solving an optimization problem to obtain a single gradient of the loss, thus slowing down first-order methods to minimize the sum of losses, that require many such gradient computations. In this work, we introduce an algorithm to solve a regularized version of this problem of Wasserstein estimators, with a time per step which is sublinear in the natural dimensions of the problem. We introduce a dual formulation, and optimize it with stochastic gradient steps that can be computed directly from samples, without solving additional optimization problems at each step. Doing so, the estimation and computation tasks are performed jointly. We show that this algorithm can be extended to other tasks, including estimation of Wasserstein barycenters. We provide theoretical guarantees and illustrate the performance of our algorithm with experiments on synthetic data.

[69]  arXiv:2002.08697 (cross-list from cs.LG) [pdf, other]
Title: Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs
Comments: A copy of this was published in IISWC'19
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2x slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning.

[70]  arXiv:2002.08709 (cross-list from cs.LG) [pdf, other]
Title: Do We Need Zero Training Loss After Achieving Zero Training Error?
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Overparameterized deep networks have the capacity to memorize training data with zero training error. Even after memorization, the training loss continues to approach zero, making the model overconfident and the test performance degraded. Since existing regularizers do not directly aim to avoid zero training loss, they often fail to maintain a moderate level of training loss, ending up with a too small or too large loss. We propose a direct solution called flooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value, which we call the flooding level. Our approach makes the loss float around the flooding level by doing mini-batched gradient descent as usual but gradient ascent if the training loss is below the flooding level. This can be implemented with one line of code, and is compatible with any stochastic optimizer and other regularizers. With flooding, the model will continue to "random walk" with the same non-zero training loss, and we expect it to drift into an area with a flat loss landscape that leads to better generalization. We experimentally show that flooding improves performance and as a byproduct, induces a double descent curve of the test loss.

[71]  arXiv:2002.08717 (cross-list from math.OC) [pdf, ps, other]
Title: The Directional Optimal Transport
Comments: 30 pages, 5 figures
Subjects: Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)

We introduce a constrained optimal transport problem where origins $x$ can only be transported to destinations $y\geq x$. Our statistical motivation is to describe the sharp upper bound for the variance of the treatment effect $Y-X$ given marginals when the effect is monotone, or $Y\geq X$. We thus focus on supermodular costs (or submodular rewards) and introduce a coupling $P_{*}$ that is optimal for all such costs and yields the sharp bound. This coupling admits manifold characterizations---geometric, order-theoretic, as optimal transport, through the cdf, and via the transport kernel---that explain its structure and imply useful bounds. When the first marginal is atomless, $P_{*}$ is concentrated on the graphs of two maps which can be described in terms of the marginals, the second map arising due to the binding constraint.

[72]  arXiv:2002.08740 (cross-list from cs.LG) [pdf, other]
Title: Towards Certifiable Adversarial Sample Detection
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)

Convolutional Neural Networks (CNNs) are deployed in more and more classification systems, but adversarial samples can be maliciously crafted to trick them, and are becoming a real threat. There have been various proposals to improve CNNs' adversarial robustness but these all suffer performance penalties or other limitations. In this paper, we provide a new approach in the form of a certifiable adversarial detection scheme, the Certifiable Taboo Trap (CTT). The system can provide certifiable guarantees of detection of adversarial inputs for certain $l_{\infty}$ sizes on a reasonable assumption, namely that the training data have the same distribution as the test data. We develop and evaluate several versions of CTT with a range of defense capabilities, training overheads and certifiability on adversarial samples. Against adversaries with various $l_p$ norms, CTT outperforms existing defense methods that focus purely on improving network robustness. We show that CTT has small false positive rates on clean test data, minimal compute overheads when deployed, and can support complex security policies.

[73]  arXiv:2002.08762 (cross-list from cs.LG) [pdf, other]
Title: Error detection in Knowledge Graphs: Path Ranking, Embeddings or both?
Comments: 19 pages, 3 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)

This paper attempts to compare and combine different approaches for de-tecting errors in Knowledge Graphs. Knowledge Graphs constitute a mainstreamapproach for the representation of relational information on big heterogeneous data,however, they may contain a big amount of imputed noise when constructed auto-matically. To address this problem, different error detection methodologies have beenproposed, mainly focusing on path ranking and representation learning. This workpresents various mainstream approaches and proposes a novel hybrid and modularmethodology for the task. We compare these methods on two benchmarks and one real-world biomedical publications dataset, showcasing the potential of our approach anddrawing insights regarding the state-of-art in error detection in Knowledge Graphs

[74]  arXiv:2002.08772 (cross-list from cs.LG) [pdf, other]
Title: Set2Graph: Learning Graphs From Sets
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Many problems in machine learning (ML) can be cast as learning functions from sets to graphs, or more generally to hypergraphs; in short, Set2Graph functions. Examples include clustering, learning vertex and edge features on graphs, and learning triplet data in a collection. Current neural network models that approximate Set2Graph functions come from two main ML sub-fields: equivariant learning, and similarity learning. Equivariant models would be in general computationally challenging or even infeasible, while similarity learning models can be shown to have limited expressive power. In this paper we suggest a neural network model family for learning Set2Graph functions that is both practical and of maximal expressive power (universal), that is, can approximate arbitrary continuous Set2Graph functions over compact sets. Testing our models on different machine learning tasks, including an application to particle physics, we find them favorable to existing baselines.

[75]  arXiv:2002.08782 (cross-list from cs.LG) [pdf, other]
Title: Dynamic Federated Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Federated learning has emerged as an umbrella term for centralized coordination strategies in multi-agent environments. While many federated learning architectures process data in an online manner, and are hence adaptive by nature, most performance analyses assume static optimization problems and offer no guarantees in the presence of drifts in the problem solution or data characteristics. We consider a federated learning model where at every iteration, a random subset of available agents perform local updates based on their data. Under a non-stationary random walk model on the true minimizer for the aggregate optimization problem, we establish that the performance of the architecture is determined by three factors, namely, the data variability at each agent, the model variability across all agents, and a tracking term that is inversely proportional to the learning rate of the algorithm. The results clarify the trade-off between convergence and tracking performance.

[76]  arXiv:2002.08791 (cross-list from cs.LG) [pdf, other]
Title: Bayesian Deep Learning and a Probabilistic Perspective of Generalization
Comments: 27 pages, 17 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

The key distinguishing property of a Bayesian approach is marginalization, rather than using a single setting of weights. Bayesian marginalization can particularly improve the accuracy and calibration of modern deep neural networks, which are typically underspecified by the data, and can represent many compelling but different solutions. We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction, without significant overhead. We also investigate the prior over functions implied by a vague distribution over neural network weights, explaining the generalization properties of such models from a probabilistic perspective. From this perspective, we explain results that have been presented as mysterious and distinct to neural network generalization, such as the ability to fit images with random labels, and show that these results can be reproduced with Gaussian processes. Finally, we provide a Bayesian perspective on tempering for calibrating predictive distributions.

[77]  arXiv:2002.08799 (cross-list from cs.LG) [pdf, other]
Title: A Structured Prediction Approach for Conditional Meta-Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Optimization-based meta-learning algorithms are a powerful class of methods for learning-to-learn applications such as few-shot learning. They tackle the limited availability of training data by leveraging the experience gained from previously observed tasks. However, when the complexity of the tasks distribution cannot be captured by a single set of shared meta-parameters, existing methods may fail to fully adapt to a target task. We address this issue with a novel perspective on conditional meta-learning based on structured prediction. We propose task-adaptive structured meta-learning (TASML), a principled estimator that weighs meta-training data conditioned on the target task to design tailored meta-learning objectives. In addition, we introduce algorithmic improvements to tackle key computational limitations of existing methods. Experimentally, we show that TASML outperforms state-of-the-art methods on benchmark datasets both in terms of accuracy and efficiency. An ablation study quantifies the individual contribution of model components and suggests useful practices for meta-learning.

[78]  arXiv:2002.08803 (cross-list from cs.LG) [pdf, other]
Title: Support-weighted Adversarial Imitation Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Adversarial Imitation Learning (AIL) is a broad family of imitation learning methods designed to mimic expert behaviors from demonstrations. While AIL has shown state-of-the-art performance on imitation learning with only small number of demonstrations, it faces several practical challenges such as potential training instability and implicit reward bias. To address the challenges, we propose Support-weighted Adversarial Imitation Learning (SAIL), a general framework that extends a given AIL algorithm with information derived from support estimation of the expert policies. SAIL improves the quality of the reinforcement signals by weighing the adversarial reward with a confidence score from support estimation of the expert policy. We also show that SAIL is always at least as efficient as the underlying AIL algorithm that SAIL uses for learning the adversarial reward. Empirically, we show that the proposed method achieves better performance and training stability than baseline methods on a wide range of benchmark control tasks.

[79]  arXiv:2002.08831 (cross-list from math.NA) [pdf, ps, other]
Title: Efficiently updating a covariance matrix and its LDL decomposition
Subjects: Numerical Analysis (math.NA); Computation (stat.CO)

Equations are presented which efficiently update or downdate the covariance matrix of a large number of $m$-dimensional observations. Updates and downdates to the covariance matrix, as well as mixed updates/downdates, are shown to be rank-$k$ modifications, where $k$ is the number of new observations added plus the number of old observations removed. As a result, the update and downdate equations decrease the required number of multiplications for a modification to $\Theta((k+1)m^2)$ instead of $\Theta((n+k+1)m^2)$ or $\Theta((n-k+1)m^2)$, where $n$ is the number of initial observations. Having the rank-$k$ formulas for the updates also allows a number of other known identities to be applied, providing a way of applying updates and downdates directly to the inverse and decompositions of the covariance matrix. To illustrate, we provide an efficient algorithm for applying the rank-$k$ update to the LDL decomposition of a covariance matrix.

[80]  arXiv:2002.08837 (cross-list from cs.LG) [pdf, other]
Title: No-Regret and Incentive-Compatible Online Learning
Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)

We study online learning settings in which experts act strategically to maximize their influence on the learning algorithm's predictions by potentially misreporting their beliefs about a sequence of binary events. Our goal is twofold. First, we want the learning algorithm to be no-regret with respect to the best fixed expert in hindsight. Second, we want incentive compatibility, a guarantee that each expert's best strategy is to report his true beliefs about the realization of each event. To achieve this goal, we build on the literature on wagering mechanisms, a type of multi-agent scoring rule. We provide algorithms that achieve no regret and incentive compatibility for myopic experts for both the full and partial information settings. In experiments on datasets from FiveThirtyEight, our algorithms have regret comparable to classic no-regret algorithms, which are not incentive-compatible. Finally, we identify an incentive-compatible algorithm for forward-looking strategic agents that exhibits diminishing regret in practice.

[81]  arXiv:2002.08838 (cross-list from cs.LG) [pdf, other]
Title: On the Decision Boundaries of Deep Neural Networks: A Tropical Geometry Perspective
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

This work tackles the problem of characterizing and understanding the decision boundaries of neural networks with piecewise linear non-linearity activations. We use tropical geometry, a new development in the area of algebraic geometry, to characterize the decision boundaries of a simple neural network of the form (Affine, ReLU, Affine). Our main finding is that the decision boundaries are a subset of a tropical hypersurface, which is intimately related to a polytope formed by the convex hull of two zonotopes. The generators of these zonotopes are functions of the neural network parameters. This geometric characterization provides new perspective to three tasks. Specifically, we propose a new tropical perspective to the lottery ticket hypothesis, where we see the effect of different initializations on the tropical geometric representation of a network's decision boundaries. Moreover, we use this characterization to propose a new set of tropical regularizers, which directly deal with the decision boundaries of a network. We investigate the use of these regularizers in neural network pruning (by removing network parameters that do not contribute to the tropical geometric representation of the decision boundaries) and in generating adversarial input attacks (by producing input perturbations that explicitly perturb the decision boundaries' geometry and ultimately change the network's prediction).

[82]  arXiv:2002.08849 (cross-list from q-fin.ST) [pdf, other]
Title: Forecasting Realized Volatility Matrix With Copula-Based Models
Comments: 26 pages, 3 figures
Subjects: Statistical Finance (q-fin.ST); Applications (stat.AP)

Multivariate volatility modeling and forecasting are crucial in financial economics. This paper develops a copula-based approach to model and forecast realized volatility matrices. The proposed copula-based time series models can capture the hidden dependence structure of realized volatility matrices. Also, this approach can automatically guarantee the positive definiteness of the forecasts through either Cholesky decomposition or matrix logarithm transformation. In this paper we consider both multivariate and bivariate copulas; the types of copulas include Student's t, Clayton and Gumbel copulas. In an empirical application, we find that for one-day ahead volatility matrix forecasting, these copula-based models can achieve significant performance both in terms of statistical precision as well as creating economically mean-variance efficient portfolio. Among the copulas we considered, the multivariate-t copula performs better in statistical precision, while bivariate-t copula has better economical performance.

[83]  arXiv:2002.08856 (cross-list from math.OC) [pdf, ps, other]
Title: Bounding the expected run-time of nonconvex optimization with early stopping
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

This work examines the convergence of stochastic gradient-based optimization algorithms that use early stopping based on a validation function. The form of early stopping we consider is that optimization terminates when the norm of the gradient of a validation function falls below a threshold. We derive conditions that guarantee this stopping rule is well-defined, and provide bounds on the expected number of iterations and gradient evaluations needed to meet this criterion. The guarantee accounts for the distance between the training and validation sets, measured with the Wasserstein distance. We develop the approach in the general setting of a first-order optimization algorithm, with possibly biased update directions subject to a geometric drift condition. We then derive bounds on the expected running time for early stopping variants of several algorithms, including stochastic gradient descent (SGD), decentralized SGD (DSGD), and the stochastic variance reduced gradient (SVRG) algorithm. Finally, we consider the generalization properties of the iterate returned by early stopping.

[84]  arXiv:2002.08859 (cross-list from cs.LG) [pdf, other]
Title: A Bayes-Optimal View on Adversarial Examples
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

The ability to fool modern CNN classifiers with tiny perturbations of the input has lead to the development of a large number of candidate defenses and often conflicting explanations. In this paper, we argue for examining adversarial examples from the perspective of Bayes-Optimal classification. We construct realistic image datasets for which the Bayes-Optimal classifier can be efficiently computed and derive analytic conditions on the distributions so that the optimal classifier is either robust or vulnerable. By training different classifiers on these datasets (for which the "gold standard" optimal classifiers are known), we can disentangle the possible sources of vulnerability and avoid the accuracy-robustness tradeoff that may occur in commonly used datasets. Our results show that even when the optimal classifier is robust, standard CNN training consistently learns a vulnerable classifier. At the same time, for exactly the same training data, RBF SVMs consistently learn a robust classifier. The same trend is observed in experiments with real images.

[85]  arXiv:2002.08860 (cross-list from cs.LG) [pdf, other]
Title: Dissipative SymODEN: Encoding Hamiltonian Dynamics with Dissipation and Control into Deep Learning
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)

In this work, we introduce Dissipative SymODEN, a deep learning architecture which can infer the dynamics of a physical system with dissipation from observed state trajectories. To improve prediction accuracy while reducing network size, Dissipative SymODEN encodes the port-Hamiltonian dynamics with energy dissipation and external input into the design of its computation graph and learns the dynamics in a structured way. The learned model, by revealing key aspects of the system, such as the inertia, dissipation, and potential energy, paves the way for energy-based controllers.

[86]  arXiv:2002.08898 (cross-list from cs.CL) [pdf, other]
Title: MA-DST: Multi-Attention Based Scalable Dialog State Tracking
Comments: Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

Task oriented dialog agents provide a natural language interface for users to complete their goal. Dialog State Tracking (DST), which is often a core component of these systems, tracks the system's understanding of the user's goal throughout the conversation. To enable accurate multi-domain DST, the model needs to encode dependencies between past utterances and slot semantics and understand the dialog context, including long-range cross-domain references. We introduce a novel architecture for this task to encode the conversation history and slot semantics more robustly by using attention mechanisms at multiple granularities. In particular, we use cross-attention to model relationships between the context and slots at different semantic levels and self-attention to resolve cross-domain coreferences. In addition, our proposed architecture does not rely on knowing the domain ontologies beforehand and can also be used in a zero-shot setting for new domains or unseen slot values. Our model improves the joint goal accuracy by 5% (absolute) in the full-data setting and by up to 2% (absolute) in the zero-shot setting over the present state-of-the-art on the MultiWoZ 2.1 dataset.

[87]  arXiv:2002.08902 (cross-list from cs.CL) [pdf, other]
Title: Application of Pre-training Models in Named Entity Recognition
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task to extract entities from unstructured data. The previous methods for NER were based on machine learning or deep learning. Recently, pre-training models have significantly improved performance on multiple NLP tasks. In this paper, firstly, we introduce the architecture and pre-training tasks of four common pre-training models: BERT, ERNIE, ERNIE2.0-tiny, and RoBERTa. Then, we apply these pre-training models to a NER task by fine-tuning, and compare the effects of the different model architecture and pre-training tasks on the NER task. The experiment results showed that RoBERTa achieved state-of-the-art results on the MSRA-2006 dataset.

[88]  arXiv:2002.08907 (cross-list from math.OC) [pdf, other]
Title: Second-order Conditional Gradients
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)

Constrained second-order convex optimization algorithms are the method of choice when a high accuracy solution to a problem is needed, due to the quadratic convergence rates these methods enjoy when close to the optimum. These algorithms require the solution of a constrained quadratic subproblem at every iteration. In the case where the feasible region can only be accessed efficiently through a linear optimization oracle, and computing first-order information about the function, although possible, is costly, the coupling of constrained second-order and conditional gradient algorithms leads to competitive algorithms with solid theoretical guarantees and good numerical performance.

[89]  arXiv:2002.08910 (cross-list from cs.CL) [pdf, other]
Title: How Much Knowledge Can You Pack Into the Parameters of a Language Model?
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the practical utility of this approach by fine-tuning pre-trained models to answer questions without access to any external context or knowledge. We show that this approach scales surprisingly well with model size and outperforms models that explicitly look up knowledge on the open-domain variants of Natural Questions and WebQuestions.

[90]  arXiv:2002.08927 (cross-list from cs.LG) [pdf, other]
Title: Regularized Autoencoders via Relaxed Injective Probability Flow
Comments: AISTATS 2020
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Invertible flow-based generative models are an effective method for learning to generate samples, while allowing for tractable likelihood computation and inference. However, the invertibility requirement restricts models to have the same latent dimensionality as the inputs. This imposes significant architectural, memory, and computational costs, making them more challenging to scale than other classes of generative models such as Variational Autoencoders (VAEs). We propose a generative model based on probability flows that does away with the bijectivity requirement on the model and only assumes injectivity. This also provides another perspective on regularized autoencoders (RAEs), with our final objectives resembling RAEs with specific regularizers that are derived by lower bounding the probability flow objective. We empirically demonstrate the promise of the proposed model, improving over VAEs and AEs in terms of sample quality.

[91]  arXiv:2002.08930 (cross-list from cs.LG) [pdf, other]
Title: Multi-step Online Unsupervised Domain Adaptation
Comments: To appear in ICASSP 2020. Copyright 2020 IEEE
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In this paper, we address the Online Unsupervised Domain Adaptation (OUDA) problem, where the target data are unlabelled and arriving sequentially. The traditional methods on the OUDA problem mainly focus on transforming each arriving target data to the source domain, and they do not sufficiently consider the temporal coherency and accumulative statistics among the arriving target data. We propose a multi-step framework for the OUDA problem, which institutes a novel method to compute the mean-target subspace inspired by the geometrical interpretation on the Euclidean space. This mean-target subspace contains accumulative temporal information among the arrived target data. Moreover, the transformation matrix computed from the mean-target subspace is applied to the next target data as a preprocessing step, aligning the target data closer to the source domain. Experiments on four datasets demonstrated the contribution of each step in our proposed multi-step OUDA framework and its performance over previous approaches.

[92]  arXiv:2002.08933 (cross-list from eess.AS) [pdf, other]
Title: Wavesplit: End-to-End Speech Separation by Speaker Clustering
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Machine Learning (stat.ML)

We introduce Wavesplit, an end-to-end speech separation system. From a single recording of mixed speech, the model infers and clusters representations of each speaker and then estimates each source signal conditioned on the inferred representations. The model is trained on the raw waveform to jointly perform the two tasks. Our model infers a set of speaker representations through clustering, which addresses the fundamental permutation problem of speech separation. Moreover, the sequence-wide speaker representations provide a more robust separation of long, challenging sequences, compared to previous approaches. We show that Wavesplit outperforms the previous state-of-the-art on clean mixtures of 2 or 3 speakers (WSJ0-2mix, WSJ0-3mix), as well as in noisy (WHAM!) and reverberated (WHAMR!) conditions. As an additional contribution, we further improve our model by introducing online data augmentation for separation.

[93]  arXiv:2002.08934 (cross-list from cs.LG) [pdf, ps, other]
Title: Online high rank matrix completion
Comments: The paper was published by the proceedings of IEEE CVPR 2019
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Recent advances in matrix completion enable data imputation in full-rank matrices by exploiting low dimensional (nonlinear) latent structure. In this paper, we develop a new model for high rank matrix completion (HRMC), together with batch and online methods to fit the model and out-of-sample extension to complete new data. The method works by (implicitly) mapping the data into a high dimensional polynomial feature space using the kernel trick; importantly, the data occupies a low dimensional subspace in this feature space, even when the original data matrix is of full-rank. We introduce an explicit parametrization of this low dimensional subspace, and an online fitting procedure, to reduce computational complexity compared to the state of the art. The online method can also handle streaming or sequential data and adapt to non-stationary latent structure. We provide guidance on the sampling rate required these methods to succeed. Experimental results on synthetic data and motion capture data validate the performance of the proposed methods.

[94]  arXiv:2002.08936 (cross-list from cs.LG) [pdf, other]
Title: Meta-learning for mixed linear regression
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In modern supervised learning, there are a large number of tasks, but many of them are associated with only a small amount of labeled data. These include data from medical image processing and robotic interaction. Even though each individual task cannot be meaningfully trained in isolation, one seeks to meta-learn across the tasks from past experiences by exploiting some similarities. We study a fundamental question of interest: When can abundant tasks with small data compensate for lack of tasks with big data? We focus on a canonical scenario where each task is drawn from a mixture of $k$ linear regressions, and identify sufficient conditions for such a graceful exchange to hold; The total number of examples necessary with only small data tasks scales similarly as when big data tasks are available. To this end, we introduce a novel spectral approach and show that we can efficiently utilize small data tasks with the help of $\tilde\Omega(k^{3/2})$ medium data tasks each with $\tilde\Omega(k^{1/2})$ examples.

[95]  arXiv:2002.08937 (cross-list from cs.LG) [pdf, other]
Title: Nyström Subspace Learning for Large-scale SVMs
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

As an implementation of the Nystr\"{o}m method, Nystr\"{o}m computational regularization (NCR) imposed on kernel classification and kernel ridge regression has proven capable of achieving optimal bounds in the large-scale statistical learning setting, while enjoying much better time complexity. In this study, we propose a Nystr\"{o}m subspace learning (NSL) framework to reveal that all you need for employing the Nystr\"{o}m method, including NCR, upon any kernel SVM is to use the efficient off-the-shelf linear SVM solvers as a black box. Based on our analysis, the bounds developed for the Nystr\"{o}m method are linked to NSL, and the analytical difference between two distinct implementations of the Nystr\"{o}m method is clearly presented. Besides, NSL also leads to sharper theoretical results for the clustered Nystr\"{o}m method. Finally, both regression and classification tasks are performed to compare two implementations of the Nystr\"{o}m method.

[96]  arXiv:2002.08949 (cross-list from cs.LG) [pdf, other]
Title: Improving Sampling Accuracy of Stochastic Gradient MCMC Methods via Non-uniform Subsampling of Gradients
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Common Stochastic Gradient MCMC methods approximate gradients by stochastic ones via uniformly subsampled data points. We propose that a non-uniform subsampling can reduce the variance introduced by the stochastic approximation, hence making the sampling of a target distribution more accurate. An exponentially weighted stochastic gradient approach (EWSG) is developed for this objective by matching the transition kernels of SG-MCMC methods respectively based on stochastic and batch gradients. A demonstration of EWSG combined with second-order Langevin equation for sampling purposes is provided. In our method, non-uniform subsampling is done efficiently via a Metropolis-Hasting chain on the data index, which is coupled to the sampling algorithm. The fact that our method has reduced local variance with high probability is theoretically analyzed. A non-asymptotic global error analysis is also presented. Numerical experiments based on both synthetic and real world data sets are also provided to demonstrate the efficacy of the proposed approaches. While statistical accuracy has improved, the speed of convergence was empirically observed to be at least comparable to the uniform version.

Replacements for Fri, 21 Feb 20

[97]  arXiv:1406.5958 (replaced) [pdf, other]
Title: Prior sample size extensions for assessing prior informativeness and prior--likelihood discordance
Subjects: Methodology (stat.ME)
[98]  arXiv:1709.05545 (replaced) [pdf, other]
Title: Generating Compact Tree Ensembles via Annealing
Comments: Comparison with Random Forest included in the results section
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[99]  arXiv:1802.02212 (replaced) [pdf, other]
Title: Classification and Disease Localization in Histopathology Using Only Global Labels: A Weakly-Supervised Approach
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[100]  arXiv:1804.05464 (replaced) [pdf, other]
Title: On Gradient-Based Learning in Continuous Games
Journal-ref: SIAM Journal on Mathematics of Data Science 2020 2:1, 103-131
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[101]  arXiv:1807.10801 (replaced) [pdf, other]
Title: On the expected runtime of multiple testing algorithms with bounded error
Authors: Georg Hahn
Subjects: Statistics Theory (math.ST)
[102]  arXiv:1809.02963 (replaced) [pdf, ps, other]
Title: Variational Approximation Error in Bayesian Non-negative Matrix Factorization
Authors: Naoki Hayashi
Comments: 21 pages. 1 table. Revision in Neural Networks
Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
[103]  arXiv:1809.06092 (replaced) [pdf, other]
Title: Testing relevant hypotheses in functional time series via self-normalization
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
[104]  arXiv:1811.00353 (replaced) [pdf, ps, other]
Title: Hanson-Wright inequality in Banach spaces
Comments: MSC classification and acknowledgement added, minor typo corrected, references updated
Subjects: Probability (math.PR); Functional Analysis (math.FA); Statistics Theory (math.ST)
[105]  arXiv:1812.06575 (replaced) [pdf, other]
Title: Matching on Generalized Propensity Scores with Continuous Exposures
Comments: We create an R package, GPSmacthing, available at this https URL, to implement the proposed matching approach
Subjects: Methodology (stat.ME); Applications (stat.AP)
[106]  arXiv:1812.07944 (replaced) [pdf, ps, other]
Title: Estimation and Inference in the Presence of Fractional d=1/2 and Weakly Nonstationary Processes
Subjects: Statistics Theory (math.ST)
[107]  arXiv:1901.05947 (replaced) [pdf, other]
Title: Stochastic Gradient Descent on a Tree: an Adaptive and Robust Approach to Stochastic Convex Optimization
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
[108]  arXiv:1901.08560 (replaced) [pdf, other]
Title: Semi-Unsupervised Learning: Clustering and Classifying using Ultra-Sparse Labels
Comments: 8 pages, plus appendix
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[109]  arXiv:1902.03453 (replaced) [pdf]
Title: Distance metric learning based on structural neighborhoods for dimensionality reduction and classification performance improvement
Comments: 30 pages, 5 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[110]  arXiv:1902.11038 (replaced) [pdf, other]
Title: Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labels
Comments: AAAI Conference on Artificial Intelligence (AAAI 2020)
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[111]  arXiv:1902.11045 (replaced) [pdf, other]
Title: Virtual Adversarial Training on Graph Convolutional Networks in Node Classification
Comments: Chinese Conference on Pattern Recognition and Computer Vision(PRCV) 2019 Oral paper
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[112]  arXiv:1903.00374 (replaced) [pdf, other]
Title: Model-Based Reinforcement Learning for Atari
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[113]  arXiv:1903.05315 (replaced) [pdf, ps, other]
Title: Optimality of Maximum Likelihood for Log-Concave Density Estimation and Bounded Convex Regression
Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG)
[114]  arXiv:1903.09231 (replaced) [pdf, ps, other]
Title: Recovering the Lowest Layer of Deep Networks with High Threshold Activations
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[115]  arXiv:1903.09321 (replaced) [pdf, other]
Title: WONDER: Weighted one-shot distributed ridge regression in high dimensions
Comments: Gave the name "Wonder" to the algorithm, updated title, added algorithm for general non-isotropic design
Subjects: Statistics Theory (math.ST); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Computation (stat.CO)
[116]  arXiv:1903.10646 (replaced) [pdf, other]
Title: Increasing Iterate Averaging for Solving Saddle-Point Problems
Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC); Machine Learning (stat.ML)
[117]  arXiv:1904.06145 (replaced) [pdf, other]
Title: Towards Photographic Image Manipulation with Balanced Growing of Generative Autoencoders
Comments: WACV 2020
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[118]  arXiv:1905.10626 (replaced) [pdf, other]
Title: Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness
Comments: ICLR 2020
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
[119]  arXiv:1905.11379 (replaced) [pdf, ps, other]
Title: A New Non-Linear Conjugate Gradient Algorithm for Destructive Cure Rate Model and a Simulation Study: Illustration with Negative Binomial Competing Risks
Authors: Suvra Pal, Souvik Roy
Comments: arXiv admin note: text overlap with arXiv:1905.05963
Subjects: Statistics Theory (math.ST); Optimization and Control (math.OC)
[120]  arXiv:1905.12121 (replaced) [pdf, other]
Title: An Investigation of Data Poisoning Defenses for Online Learning
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
[121]  arXiv:1905.13002 (replaced) [pdf, other]
Title: Temporal Parallelization of Bayesian Smoothers
Subjects: Computation (stat.CO); Distributed, Parallel, and Cluster Computing (cs.DC); Dynamical Systems (math.DS)
[122]  arXiv:1906.02425 (replaced) [pdf, other]
Title: Uncertainty-guided Continual Learning with Bayesian Neural Networks
Comments: Accepted at ICLR 2020
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[123]  arXiv:1906.02922 (replaced) [pdf, other]
Title: Parameter-Free Learning for Evolving Markov Decision Processes: The Blessing of (More) Optimism
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[124]  arXiv:1906.04659 (replaced) [pdf, other]
Title: Stable Rank Normalization for Improved Generalization in Neural Networks and GANs
Comments: Accepted at the International Conference in Learning Representations, 2020, Addis Ababa, Ethiopia
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[125]  arXiv:1906.05467 (replaced) [pdf, other]
Title: Interpretable Generative Neural Spatio-Temporal Point Processes
Subjects: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
[126]  arXiv:1907.04155 (replaced) [pdf, other]
Title: GP-VAE: Deep Probabilistic Time Series Imputation
Comments: Accepted for publication at the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020)
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[127]  arXiv:1909.10670 (replaced) [pdf, other]
Title: Subsampling Generative Adversarial Networks: Density Ratio Estimation in Feature Space with Softplus Loss
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[128]  arXiv:1909.11515 (replaced) [pdf, other]
Title: Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks
Comments: ICLR 2020
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[129]  arXiv:1909.12077 (replaced) [pdf, other]
Title: Symplectic ODE-Net: Learning Hamiltonian Dynamics with Control
Journal-ref: International Conference on Learning Representations (ICLR 2020); https://openreview.net/forum?id=ryxmb1rKDS
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
[130]  arXiv:1909.13788 (replaced) [pdf, other]
Title: Revisiting Self-Training for Neural Sequence Generation
Comments: ICLR 2020. The first two authors contributed equally
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
[131]  arXiv:1909.13833 (replaced) [pdf, other]
Title: Relaxing Bijectivity Constraints with Continuously Indexed Normalising Flows
Comments: This is a major revision of our previous paper "Localised Generative Flows". We have significantly extended our theoretical justification, and have obtained experimental results on a wider range of baselines
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[132]  arXiv:1910.00643 (replaced) [pdf, other]
Title: SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum
Comments: Accepted to ICLR 2020
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
[133]  arXiv:1910.03175 (replaced) [pdf, other]
Title: MIM: Mutual Information Machine
Comments: Pre-print. Project webpage: this https URL
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
[134]  arXiv:1910.03344 (replaced) [pdf, ps, other]
Title: The Universal Approximation Property: Characterizations, Existence, and a Canonical Topology for Deep-Learning
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS)
[135]  arXiv:1910.03561 (replaced) [pdf, other]
Title: Deep Network Classification by Scattering and Homotopy Dictionary Learning
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[136]  arXiv:1910.04938 (replaced) [pdf, other]
Title: Regret Analysis of Causal Bandit Problems
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[137]  arXiv:1910.05270 (replaced) [pdf, ps, other]
Title: Fast and Bayes-consistent nearest neighbors
Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
[138]  arXiv:1910.05725 (replaced) [pdf, other]
Title: If dropout limits trainable depth, does critical initialisation still matter? A large-scale statistical analysis on ReLU networks
Comments: 8 pages, 6 figures, under consideration at Pattern Recognition Letters
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[139]  arXiv:1910.05769 (replaced) [pdf, ps, other]
Title: Large Deviation Analysis of Function Sensitivity in Random Deep Neural Networks
Authors: Bo Li, David Saad
Journal-ref: J. Phys. A: Math. Theor. 53. 104002 (2020)
Subjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Machine Learning (stat.ML)
[140]  arXiv:1910.08371 (replaced) [pdf, other]
Title: Graph Convolutional Policy for Solving Tree Decomposition via Reinforcement Learning Heuristics
Comments: 8 pages, 7 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[141]  arXiv:1910.11831 (replaced) [pdf, other]
Title: Stabilizing DARTS with Amended Gradient Estimation on Architectural Parameters
Comments: 21 pages, 11 figures, submitted to ICML 2020, extensive results are added
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[142]  arXiv:1911.03432 (replaced) [pdf, other]
Title: Penalty Method for Inversion-Free Deep Bilevel Optimization
Comments: 17 Pages, 7 figures
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
[143]  arXiv:1911.10633 (replaced) [pdf, other]
Title: The harmonic mean $χ^2$ test to substantiate scientific findings
Authors: Leonhard Held
Comments: Revised version
Subjects: Methodology (stat.ME)
[144]  arXiv:1912.01599 (replaced) [pdf, ps, other]
Title: Stationary Points of Shallow Neural Networks with Quadratic Activation Function
Comments: 30 pages
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
[145]  arXiv:1912.02290 (replaced) [pdf, other]
Title: Hierarchical Indian Buffet Neural Networks for Bayesian Continual Learning
Comments: Full preprint
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[146]  arXiv:1912.03703 (replaced) [pdf, other]
Title: $\mathtt{MedGraph:}$ Structural and Temporal Representation Learning of Electronic Medical Records
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[147]  arXiv:1912.04695 (replaced) [pdf, other]
Title: Transparent Classification with Multilayer Logical Perceptrons and Random Binarization
Comments: AAAI-20 (oral presentation); source codes added
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[148]  arXiv:1912.05541 (replaced) [pdf, other]
Title: Fundamental Entropic Laws and $\mathcal{L}_p$ Limitations of Feedback Systems: Implications for Machine-Learning-in-the-Loop Control
Comments: arXiv admin note: text overlap with arXiv:1912.02628
Subjects: Systems and Control (eess.SY); Information Theory (cs.IT); Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
[149]  arXiv:1912.05695 (replaced) [pdf, other]
Title: Randomized Exploration for Non-Stationary Stochastic Linear Bandits
Comments: The current version is bug-free after correction
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[150]  arXiv:1912.08335 (replaced) [pdf, other]
Title: Learning under Model Misspecification: Applications to Variational and Ensemble methods
Comments: Typos corrected. Section 3 partially revised. New section at the appendix
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
[151]  arXiv:2001.01385 (replaced) [pdf, other]
Title: Identifying and Compensating for Feature Deviation in Imbalanced Deep Learning
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[152]  arXiv:2001.05494 (replaced) [pdf, other]
Title: Learning Style-Aware Symbolic Music Representations by Adversarial Autoencoders
Comments: Accepted for publication at the 24th European Conference on Artificial Intelligence (ECAI2020)
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Machine Learning (stat.ML)
[153]  arXiv:2001.07524 (replaced) [pdf, other]
Title: Node Masking: Making Graph Neural Networks Generalize and Scale Better
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[154]  arXiv:2002.01910 (replaced) [pdf, other]
Title: FastGAE: Fast, Scalable and Effective Graph Autoencoders with Stochastic Subgraph Decoding
Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
[155]  arXiv:2002.03495 (replaced) [pdf, ps, other]
Title: A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient Descent Escapes From Sharp Minima Exponentially Fast
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[156]  arXiv:2002.03575 (replaced) [pdf, other]
Title: Bilinear Graph Neural Network with Node Interactions
Subjects: Machine Learning (cs.LG); Graphics (cs.GR); Machine Learning (stat.ML)
[157]  arXiv:2002.03864 (replaced) [pdf, other]
Title: Deep Graph Mapper: Seeing Graphs through the Neural Lens
Comments: 13 pages, 10 figures
Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
[158]  arXiv:2002.04014 (replaced) [pdf, other]
Title: Statistically Efficient Off-Policy Policy Gradients
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
[159]  arXiv:2002.04108 (replaced) [pdf, other]
Title: Adversarial Filters of Dataset Biases
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
[160]  arXiv:2002.04320 (replaced) [pdf, other]
Title: Self-concordant analysis of Frank-Wolfe algorithms
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Computation (stat.CO)
[161]  arXiv:2002.05648 (replaced) [pdf, ps, other]
Title: Politics of Adversarial Machine Learning
Comments: Authors ordered alphabetically; 4 pages
Subjects: Computers and Society (cs.CY); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
[162]  arXiv:2002.06117 (replaced) [pdf, ps, other]
Title: Local continuity of log-concave projection, with applications to estimation under model misspecification
Subjects: Statistics Theory (math.ST)
[163]  arXiv:2002.06715 (replaced) [pdf, other]
Title: BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning
Journal-ref: Eighth International Conference on Learning Representations (ICLR 2020)
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[164]  arXiv:2002.07916 (replaced) [pdf, other]
Title: Information Condensing Active Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[ total of 164 entries: 1-164 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2002, contact, help  (Access key information)