Statistics
New submissions
[ showing up to 2000 entries per page: fewer  more ]
New submissions for Fri, 21 Feb 20
 [1] arXiv:2002.08404 [pdf, other]

Title: Implicit Regularization of Random Feature ModelsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Random Feature (RF) models are used as efficient parametric approximations of kernel methods. We investigate, by means of random matrix theory, the connection between Gaussian RF models and Kernel Ridge Regression (KRR). For a Gaussian RF model with $P$ features, $N$ data points, and a ridge $\lambda$, we show that the average (i.e. expected) RF predictor is close to a KRR predictor with an effective ridge $\tilde{\lambda}$. We show that $\tilde{\lambda} > \lambda$ and $\tilde{\lambda} \searrow \lambda$ monotonically as $P$ grows, thus revealing the implicit regularization effect of finite RF sampling. We then compare the risk (i.e. test error) of the $\tilde{\lambda}$KRR predictor with the average risk of the $\lambda$RF predictor and obtain a precise and explicit bound on their difference. Finally, we empirically find an extremely good agreement between the test errors of the average $\lambda$RF predictor and $\tilde{\lambda}$KRR predictor.
 [2] arXiv:2002.08409 [pdf, other]

Title: On the geometric properties of finite mixture modelsSubjects: Statistics Theory (math.ST)
In this paper we relate the geometry of extremal points to properties of mixtures of distributions. For a mixture model in $\mathbb{R}^J$ we consider as a prior the mixing density given by a uniform draw of $n$ points from the unit $(J1)$simplex, with $J \leq n$. We relate the extrema of these $n$ points to a mixture model with $m \leq n$ mixture components. We first show that the extrema of the points can recover any mixture density in the convex hull of the the $n$ points via the Choquet measure. We then show that as the number of extremal points go to infinity the convex hull converges to a smooth convex body. We also state a Central Limit Theorem for the number of extremal points. In addition, we state the convergence of the sequence of the empirical measures generated by our model to the Choquet measure. We relate our model to a classical nonparametric one based on a P\'olya tree. We close with an application of our model to population genomics.
 [3] arXiv:2002.08410 [pdf, other]

Title: A Unified Framework for Gaussian Mixture Reduction with Composite Transportation DistanceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Gaussian mixture reduction (GMR) is the problem of approximating a finite Gaussian mixture by one with fewer components. It is widely used in density estimation, nonparametric belief propagation, and Bayesian recursive filtering. Although optimization and clusteringbased algorithms have been proposed for GMR, they are either computationally expensive or lacking in theoretical supports. In this work, we propose to perform GMR by minimizing the entropic regularized composite transportation distance between two mixtures. We show our approach provides a unified framework for GMR that is both interpretable and computationally efficient. Our work also bridges the gap between optimization and clusteringbased approaches for GMR. A MajorizationMinimization algorithm is developed for our optimization problem and its theoretical convergence is also established in this paper. Empirical experiments are also conducted to show the effectiveness of GMR. The effect of the choice of transportation cost on the performance of GMR is also investigated.
 [4] arXiv:2002.08412 [pdf, other]

Title: Weaklysupervised Multioutput Regression via Correlated Gaussian ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Multioutput regression seeks to infer multiple latent functions using data from multiple groups/sources while accounting for potential betweengroup similarities. In this paper, we consider multioutput regression under a weaklysupervised setting where a subset of data points from multiple groups are unlabeled. We use dependent Gaussian processes for multiple outputs constructed by convolutions with shared latent processes. We introduce hyperpriors for the multinomial probabilities of the unobserved labels and optimize the hyperparameters which we show improves estimation. We derive two variational bounds: (i) a modified variational bound for fast and stable convergence in model inference, (ii) a scalable variational bound that is amenable to stochastic optimization. We use experiments on synthetic and realworld data to show that the proposed model outperforms stateoftheart models with more accurate estimation of multiple latent functions and unobserved labels.
 [5] arXiv:2002.08422 [pdf, other]

Title: On conditional versus marginal bias in multiarmed banditsComments: 20 pagesSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
The bias of the sample means of the arms in multiarmed bandits is an important issue in adaptive data analysis that has recently received considerable attention in the literature. Existing results relate in precise ways the sign and magnitude of the bias to various sources of data adaptivity, but do not apply to the conditional inference setting in which the sample means are computed only if some specific conditions are satisfied. In this paper, we characterize the sign of the conditional bias of monotone functions of the rewards, including the sample mean. Our results hold for arbitrary conditioning events and leverage natural monotonicity properties of the data collection policy. We further demonstrate, through several examples from sequential testing and best arm identification, that the sign of the conditional and unconditional bias of the sample mean of an arm can be different, depending on the conditioning event. Our analysis offers new and interesting perspectives on the subtleties of assessing the bias in data adaptive settings.
 [6] arXiv:2002.08436 [pdf, other]

Title: Residual Bootstrap Exploration for Bandit AlgorithmsComments: The first two authors contributed equallySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In this paper, we propose a novel perturbationbased exploration method in bandit algorithms with bounded or unbounded rewards, called residual bootstrap exploration (\texttt{ReBoot}). The \texttt{ReBoot} enforces exploration by injecting datadriven randomness through a residualbased perturbation mechanism. This novel mechanism captures the underlying distributional properties of fitting errors, and more importantly boosts exploration to escape from suboptimal solutions (for small sample sizes) by inflating variance level in an \textit{unconventional} way. In theory, with appropriate variance inflation level, \texttt{ReBoot} provably secures instancedependent logarithmic regret in Gaussian multiarmed bandits. We evaluate the \texttt{ReBoot} in different synthetic multiarmed bandits problems and observe that the \texttt{ReBoot} performs better for unbounded rewards and more robustly than \texttt{Giro} \cite{kveton2018garbage} and \texttt{PHE} \cite{kveton2019perturbed}, with comparable computational efficiency to the Thompson sampling method.
 [7] arXiv:2002.08443 [pdf, other]

Title: Simultaneous Inference for Massive Data: Distributed BootstrapSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without overresampling, typically required by existing methods \cite{kleiner2014scalable,sengupta2016subsampled}, while provably achieving optimal statistical efficiency with minimal communication. Our method does not require repeatedly refitting the model but only applies multiplier bootstrap in the master machine on the gradients received from the worker machines. Simulations validate our theory.
 [8] arXiv:2002.08457 [pdf, other]

Title: ivmodel: An R Package for Inference and Sensitivity Analysis of Instrumental Variables Models with One Endogenous VariableComments: 24 pages, 2 figures, 3 tablesSubjects: Applications (stat.AP)
We present a comprehensive R software ivmodel for analyzing instrumental variables with one endogenous variable. The package implements a general class of estimators called k class estimators and two confidence intervals that are fully robust to weak instruments. The package also provides power formulas for various test statistics in instrumental variables. Finally, the package contains methods for sensitivity analysis to examine the sensitivity of the inference to instrumental variables assumptions. We demonstrate the software on the data set from Card (1995), looking at the causal effect of levels of education on log earnings where the instrument is proximity to a fouryear college.
 [9] arXiv:2002.08465 [pdf, other]

Title: Descriptive and Predictive Analysis of Euroleague Basketball Games and the Wisdom of Basketball CrowdsAuthors: Georgios GiasemidisComments: 24 pages, several figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Other Statistics (stat.OT)
In this study we focus on the prediction of basketball games in the Euroleague competition using machine learning modelling. The prediction is a binary classification problem, predicting whether a match finishes 1 (home win) or 2 (away win). Data is collected from the Euroleague's official website for the seasons 20162017, 20172018 and 20182019, i.e. in the new format era. Features are extracted from matches' data and offtheshelf supervised machine learning techniques are applied. We calibrate and validate our models. We find that simple machine learning models give accuracy not greater than 67% on the test set, worse than some sophisticated benchmark models. Additionally, the importance of this study lies in the "wisdom of the basketball crowd" and we demonstrate how the predicting power of a collective group of basketball enthusiasts can outperform machine learning models discussed in this study. We argue why the accuracy level of this group of "experts" should be set as the benchmark for future studies in the prediction of (European) basketball games using machine learning.
 [10] arXiv:2002.08476 [pdf, other]

Title: A noninferiority test for Rsquared with random regressorsAuthors: Harlan CampbellComments: 14 pages, 2 figuresSubjects: Methodology (stat.ME)
Determining the lack of association between an outcome variable and a number of different explanatory variables is frequently necessary in order to disregard a proposed model. This paper proposes a noninferiority test for the coefficient of determination (or squared multiple correlation coefficient), Rsquared, in a linear regression analysis with random predictors. The test is derived from inverting a onesided confidence interval based on a scaled central F distribution.
 [11] arXiv:2002.08505 [pdf, other]

Title: A Bayes Factor Approach with Informative Prior for Rare Genetic Variant Analysis from Next Generation Sequencing DataSubjects: Applications (stat.AP)
The discovery of rare genetic variants through Next Generation Sequencing is a very challenging issue in the field of human genetics. We propose a novel regionbased statistical approach based on a Bayes Factor (BF) to assess evidence of association between a set of rare variants (RVs) located on the same genomic region and a disease outcome in the context of casecontrol design. Marginal likelihoods are computed under the null and alternative hypotheses assuming a binomial distribution for the RV count in the region and a beta or mixture of Dirac and beta prior distribution for the probability of RV. We derive the theoretical null distribution of the BF under our prior setting and show that a Bayesian control of the False Discovery Rate (BFDR) can be obtained for genomewide inference. Informative priors are introduced using prior evidence of association from a KolmogorovSmirnov test statistic. We use our simulation program, sim1000G, to generate RV data similar to the 1,000 genomes sequencing project. Our simulation studies showed that the new BF statistic outperforms standard methods (SKAT, SKATO, Burden test) in casecontrol studies with moderate sample sizes and is equivalent to them under large sample size scenarios. Our real data application to a lung cancer casecontrol study found enrichment for RVs in known and novel cancer genes. It also suggests that using the BF with informative prior improves the overall gene discovery compared to the BF with noninformative prior.
 [12] arXiv:2002.08506 [pdf, other]

Title: Causal Inference under Networked InterferenceSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Estimating individual treatment effects from data of randomized experiments is a critical task in causal inference. The Stable Unit Treatment Value Assumption (SUTVA) is usually made in causal inference. However, interference can introduce bias when the assigned treatment on one unit affects the potential outcomes of the neighboring units. This interference phenomenon is known as spillover effect in economics or peer effect in social science. Usually, in randomized experiments or observational studies with interconnected units, one can only observe treatment responses under interference. Hence, how to estimate the superimposed causal effect and recover the individual treatment effect in the presence of interference becomes a challenging task in causal inference. In this work, we study causal effect estimation under general network interference using GNNs, which are powerful tools for capturing the dependency in the graph. After deriving causal effect estimators, we further study intervention policy improvement on the graph under capacity constraint. We give policy regret bounds under network interference and treatment capacity constraint. Furthermore, a heuristic graph structuredependent error bound for GNNbased causal estimators is provided.
 [13] arXiv:2002.08514 [pdf, ps, other]

Title: Queueing Subject To ActionDependent Server Performance: Utilization Rate ReductionSubjects: Applications (stat.AP); Optimization and Control (math.OC)
We consider a discretetime system comprising a firstcomefirstserved queue, a nonpreemptive server, and a stationary nonworkconserving scheduler. New tasks arrive at the queue according to a Bernoulli process. At each instant, the server is either busy working on a task or is available, in which case the scheduler either assigns a new task to the server or allows it to remain available (to rest). In addition to the aforementioned availability state, we assume that the server has an integervalued activity state. The activity state is nondecreasing during work periods, and is nonincreasing otherwise. In a typical application of our framework, the server performance (understood as task completion probability) worsens as the activity state increases. In this article, we expand on stabilizability results recently obtained for the same framework to establish methods to design scheduling policies that not only stabilize the queue but also reduce the utilization rate, which is understood as the infinitehorizon timeaveraged expected portion of time the server is working. This article has a main theorem leading to two main results: (i) Given an arrival rate, we describe a tractable method, using a finitedimensional linear program (LP), to compute the infimum of all utilization rates achievable by stabilizing scheduling policies. (ii) We propose a tractable method, also based on finitedimensional LPs, to obtain stabilizing scheduling policies that are arbitrarily close to the aforementioned infimum. We also establish structural and distributional convergence properties, which are used throughout the article, and are significant in their own right.
 [14] arXiv:2002.08521 [pdf, other]

Title: Network Group Hawkes Process ModelComments: 42 pagesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
In this work, we study the event occurrences of user activities on online social network platforms. To characterize the social activity interactions among network users, we propose a network group Hawkes (NGH) process model. Particularly, the observed network structure information is employed to model the users' dynamic posting behaviors. Furthermore, the users are clustered into latent groups according to their dynamic behavior patterns. To estimate the model, a constraint maximum likelihood approach is proposed. Theoretically, we establish the consistency and asymptotic normality of the estimators. In addition, we show that the group memberships can be identified consistently. To conduct estimation, a branching representation structure is firstly introduced, and a stochastic EM (StEM) algorithm is developed to tackle the computational problem. Lastly, we apply the proposed method to a social network data collected from Sina Weibo, and identify the infuential network users as an interesting application.
 [15] arXiv:2002.08541 [pdf, other]

Title: A Scalable Framework for Sparse Clustering Without ShrinkageSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Clustering, a fundamental activity in unsupervised learning, is notoriously difficult when the feature space is highdimensional. Fortunately, in many realistic scenarios, only a handful of features are relevant in distinguishing clusters. This has motivated the development of sparse clustering techniques that typically rely on kmeans within outer algorithms of high computational complexity. Current techniques also require careful tuning of shrinkage parameters, further limiting their scalability. In this paper, we propose a novel framework for sparse kmeans clustering that is intuitive, simple to implement, and competitive with stateoftheart algorithms. We show that our algorithm enjoys consistency and convergence guarantees. Our core method readily generalizes to several taskspecific algorithms such as clustering on subsets of attributes and in partially observed data settings. We showcase these contributions via simulated experiments and benchmark datasets, as well as a case study on mouse protein expression.
 [16] arXiv:2002.08542 [pdf, other]

Title: False Discovery Rate Control via Data SplittingComments: 33 pages, 10 figuresSubjects: Methodology (stat.ME)
Selecting relevant features associated with a given response variable is an important issue in many scientific fields. Quantifying quality and uncertainty of the selection via the false discovery rate (FDR) control has been of recent interest. This paper introduces a way of using datasplitting strategies to asymptotically control FDR for various feature selection techniques while maintaining high power. For each feature, the method estimates two independent significance coefficients via data splitting, and constructs a contrast statistic. The FDR control is achieved by taking advantage of the statistic's property that, for any null feature, its sampling distribution is symmetric about 0. We further propose a strategy to aggregate multiple data splits (MDS) to stabilize the selection result and boost the power. Interestingly, this multiple datasplitting approach appears capable of overcoming the power loss caused by data splitting with FDR still under control. The proposed framework is applicable to canonical statistical models including linear models, Gaussian graphical models, and deep neural networks. Simulation results, as well as a real data application, show that the proposed approaches, especially the multiple datasplitting strategy, control FDR well and are often more powerful than existing methods including the BenjaminiHochberg procedure and the knockoff filter.
 [17] arXiv:2002.08543 [pdf, ps, other]

Title: Derivation of the Exact Moments of the Distribution of Pearsons Correlation over Permutations of DataComments: 8 PagesSubjects: Statistics Theory (math.ST)
Pearson's correlation is one of the most widely used measures of association today, the importance of which to modern science cannot be understated. Two of the most common methods for computing the pvalue for a hypothesis test of this correlation method are a tstatistic and permutation sampling. When a dataset comes from a bivariate normal distribution under specific data transformations a tstatistic is exact. However, for datasets which do not follow this stipulation, both approaches are merely estimations of the distribution of over permutations of data. In this paper we explicitly show the dependency of the permutation distribution of Pearson's correlation on the central moments of the data and derive an inductive formula which allows the computation of these exact moments. This has direct implications for computing the pvalue for general datasets which could lead to more computationally accurate methods.
 [18] arXiv:2002.08545 [pdf, other]

Title: Familywise Error Rate Control by Interactive UnmaskingComments: 22 pages, 8 figuresSubjects: Methodology (stat.ME)
We propose a method for multiple hypothesis testing with familywise error rate (FWER) control, called the iFWER test. Most testing methods are predefined algorithms that do not allow modifications after observing the data. However, in practice, analysts tend to choose a promising algorithm after observing the data; unfortunately, this violates the validity of the conclusion. The iFWER test allows much flexibility: a human (or a computer program acting on the human's behalf) may adaptively guide the algorithm in a datadependent manner. We prove that our test controls FWER if the analysts adhere to a particular protocol of "masking" and "unmasking". We demonstrate via numerical experiments the power of our test under structured nonnulls, and then explore new forms of masking.
 [19] arXiv:2002.08560 [pdf, other]

Title: Robust Mestimation for Partially Observed Functional DataComments: 38 pages, 5 figuresSubjects: Methodology (stat.ME)
Irregular functional data in which densely sampled curves are observed over different ranges pose a challenge for modeling and inference, and sensitivity to outlier curves is a concern in many applications. This paper investigates a class of robust Mestimators for partially observed functional data, modeling irregular structure using a missing data framework. We derive asymptotic normality of functional Mestimator under the proposed framework and show root$n$ rates of convergence. Furthermore, we propose a class of functional trend tests to find significant directions in the trend of location. For the implementation of the inferential test, we adopt a joint bootstrap approach. The performance is demonstrated in simulations and application to data from quantitative ultrasound analysis.
 [20] arXiv:2002.08563 [pdf, other]

Title: The continuous categorical: a novel simplexvalued exponential familySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Simplexvalued data appear throughout statistics and machine learning, for example in the context of transfer learning and compression of deep networks. Existing models for this class of data rely on the Dirichlet distribution or other related loss functions; here we show these standard choices suffer systematically from a number of limitations, including bias and numerical issues that frustrate the use of flexible network models upstream of these distributions. We resolve these limitations by introducing a novel exponential family of distributions for modeling simplexvalued data  the continuous categorical, which arises as a nontrivial multivariate generalization of the recently discovered continuous Bernoulli. Unlike the Dirichlet and other typical choices, the continuous categorical results in a wellbehaved probabilistic loss function that produces unbiased estimators, while preserving the mathematical simplicity of the Dirichlet. As well as exploring its theoretical properties, we introduce sampling methods for this distribution that are amenable to the reparameterization trick, and evaluate their performance. Lastly, we demonstrate that the continuous categorical outperforms standard choices empirically, across a simulation study, an applied example on multiparty elections, and a neural network compression task.
 [21] arXiv:2002.08609 [pdf, other]

Title: A Bayesian Feature Allocation Model for Identification of Cell Subpopulations Using Cytometry DataSubjects: Applications (stat.AP)
A Bayesian feature allocation model (FAM) is presented for identifying cell subpopulations based on multiple samples of cell surface or intracellular marker expression level data obtained by cytometry by time of flight (CyTOF). Cell subpopulations are characterized by differences in expression patterns of makers, and individual cells are clustered into the subpopulations based on the patterns of their observed expression levels. A finite Indian buffet process is used to model subpopulations as latent features, and a modelbased method based on these latent feature subpopulations is used to construct cell clusters within each sample. Nonignorable missing data due to technical artifacts in mass cytometry instruments are accounted for by defining a static missing data mechanism. In contrast to conventional cell clustering methods based on observed marker expression levels that are applied separately to different samples, the FAM based method can be applied simultaneously to multiple samples, and can identify important cell subpopulations likely to be missed by conventional clustering. The proposed FAM based method is applied to jointly analyze three datasets, generated by CyTOF, to study natural killer (NK) cells. Because the subpopulations identified by the FAM may define novel NK cell subsets, this statistical analysis may provide useful information about the biology of NK cells and their potential role in cancer immunotherapy which may lead, in turn, to development of improved cellular therapies. Simulation studies of the proposed method's behavior under two cases of known subpopulations also are presented, followed by analysis of the CyTOF NK cell surface marker data.
 [22] arXiv:2002.08663 [pdf, ps, other]

Title: Learning Gaussian Graphical Models via Multiplicative WeightsComments: AISTATS 2020Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
Graphical model selection in Markov random fields is a fundamental problem in statistics and machine learning. Two particularly prominent models, the Ising model and Gaussian model, have largely developed in parallel using different (though often related) techniques, and several practical algorithms with rigorous sample complexity bounds have been established for each. In this paper, we adapt a recently proposed algorithm of Klivans and Meka (FOCS, 2017), based on the method of multiplicative weight updates, from the Ising model to the Gaussian model, via nontrivial modifications to both the algorithm and its analysis. The algorithm enjoys a sample complexity bound that is qualitatively similar to others in the literature, has a low runtime $O(mp^2)$ in the case of $m$ samples and $p$ nodes, and can trivially be implemented in an online manner.
 [23] arXiv:2002.08724 [pdf, other]

Title: Generalized sampling with functional principal components for highresolution random field estimationAuthors: Milana GataricSubjects: Statistics Theory (math.ST); Signal Processing (eess.SP); Numerical Analysis (math.NA); Machine Learning (stat.ML)
In this paper, we take a statistical approach to the problem of recovering a function from lowresolution measurements taken with respect to an arbitrary basis, by regarding the function of interest as a realization of a random field. We introduce an infinitedimensional framework for highresolution estimation of a random field from its lowresolution indirect measurements as well as the highresolution measurements of training observations by merging the existing frameworks of generalized sampling and functional principal component analysis. We study the statistical performance of the resulting estimation procedure and show that highresolution recovery is indeed possible provided appropriate lowrank and angle conditions hold and provided the training set is sufficiently large relative to the desired resolution. We also consider sparse representations of the principle components, which can reduce the required size of the training set. Furthermore, the effectiveness of the proposed procedure is investigated in various numerical examples.
 [24] arXiv:2002.08731 [pdf, other]

Title: APTER: Aggregated Prognosis Through Exponential ReweightingSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
This paper considers the task of learning how to make a prognosis of a patient based on his/her microarray expression levels. The method is an application of the aggregation method as recently proposed in the literature on theoretical machine learning, and excels in its computational convenience and capability to deal with highdimensional data. A formal analysis of the method is given, yielding rates of convergence similar to what traditional techniques obtain, while it is shown to cope well with an exponentially large set of features. Those results are supported by numerical simulations on a range of publicly available survivalmicroarray datasets. It is empirically found that the proposed technique combined with a recently proposed preprocessing technique gives excellent performances.
 [25] arXiv:2002.08757 [pdf, other]

Title: Asymptotically Optimal Bias Reduction for Parametric ModelsComments: arXiv admin note: substantial text overlap with arXiv:1907.11541Subjects: Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
An important challenge in statistical analysis concerns the control of the finite sample bias of estimators. This problem is magnified in highdimensional settings where the number of variables $p$ diverges with the sample size $n$, as well as for nonlinear models and/or models with discrete data. For these complex settings, we propose to use a general simulationbased approach and show that the resulting estimator has a bias of order $\mathcal{O}(0)$, hence providing an asymptotically optimal bias reduction. It is based on an initial estimator that can be slightly asymptotically biased, making the approach very generally applicable. This is particularly relevant when classical estimators, such as the maximum likelihood estimator, can only be (numerically) approximated. We show that the iterative bootstrap of Kuk (1995) provides a computationally efficient approach to compute this bias reduced estimator. We illustrate our theoretical results in simulation studies for which we develop new bias reduced estimators for the logistic regression, with and without random effects. These estimators enjoy additional properties such as robustness to data contamination and to the problem of separability.
 [26] arXiv:2002.08774 [pdf, ps, other]

Title: Propose, Test, Release: Differentially private estimation with high probabilityComments: arXiv admin note: text overlap with arXiv:1906.11923Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Statistics Theory (math.ST)
We derive concentration inequalities for differentially private median and mean estimators building on the "Propose, Test, Release" (PTR) mechanism introduced by Dwork and Lei (2009). We introduce a new general version of the PTR mechanism that allows us to derive high probability error bounds for differentially private estimators. Our algorithms provide the first statistical guarantees for differentially private estimation of the median and mean without any boundedness assumptions on the data, and without assuming that the target population parameter lies in some known bounded interval. Our procedures do not rely on any truncation of the data and provide the first subGaussian high probability bounds for differentially private median and mean estimation, for possibly heavy tailed random variables.
 [27] arXiv:2002.08789 [pdf, other]

Title: Consistent model selection procedure for general integervalued time seriesSubjects: Statistics Theory (math.ST)
This paper deals with the problem of model selection for a general class of integervalued time series.
We propose a penalized criterion based on the Poisson quasilikelihood of the model.
Under certain regularity conditions, the consistency of the procedure as well as the consistency and the asymptotic normality of the Poisson quasilikelihood estimator of the selected model are established.
Simulation experiments are conducted for some classical models such as Poisson, binary INGARCH and negative binomial model with nonlinear dynamic. Also, an application to a real dataset is provided.  [28] arXiv:2002.08797 [pdf, other]

Title: Pruning untrained neural networks: Principles and AnalysisComments: 50 pages, 12 figuresSubjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Overparameterized neural networks display stateofthe art performance. However, there is a growing need for smaller, energyefficient, neural networks to be able to use machine learning applications on devices with limited computational resources. A popular approach consists of using pruning techniques. While these techniques have traditionally focused on pruning pretrained neural networks (e.g. LeCun et al. (1990) and Hassabi et al. (1993)), recent work by Lee et al. (2018) showed promising results where pruning is performed at initialization. However, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, these procedures do not prevent one layer being fully pruned. In this paper we provide a comprehensive theoretical analysis of pruning at initialization and training sparse architectures. This analysis allows us to propose novel principled approaches which we validate experimentally on a variety of network architectures. We particularly show that we can prune up to 99.9% of the weights while keeping the model trainable.
 [29] arXiv:2002.08853 [pdf, other]

Title: A General Pairwise Comparison Model for Extremely Sparse NetworksComments: 27 pages, 4 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
Statistical inference using pairwise comparison data has been an effective approach to analyzing complex and sparse networks. In this paper we propose a general framework for modeling the mutual interaction in a probabilistic network, which enjoys ample flexibility in terms of parametrization. Within this setup, we establish that the maximum likelihood estimator (MLE) for the latent scores of the subjects is uniformly consistent under a nearminimal condition on network sparsity. This condition is sharp in terms of the leading order asymptotics describing the sparsity. The proof utilizes a novel chaining technique based on the errorinduced metric as well as careful counting of comparison graph structures. Our results guarantee that the MLE is a valid estimator for inference in largescale comparison networks where data is asymptotically deficient. Numerical simulations are provided to complement the theoretical analysis.
 [30] arXiv:2002.08871 [pdf, other]

Title: Fast Differentiable Sorting and RankingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The sorting operation is one of the most basic and commonly used building blocks in computer programming. In machine learning, it is commonly used for robust statistics. However, seen as a function, it is piecewise linear and as a result includes many kinks at which it is nondifferentiable. More problematic is the related ranking operator, commonly used for order statistics and ranking metrics. It is a piecewise constant function, meaning that its derivatives are null or undefined. While numerous works have proposed differentiable proxies to sorting and ranking, they do not achieve the $O(n \log n)$ time complexity one would expect from sorting and ranking operations. In this paper, we propose the first differentiable sorting and ranking operators with $O(n \log n)$ time and $O(n)$ space complexity. Our proposal in addition enjoys exact computation and differentiation. We achieve this feat by constructing differentiable sorting and ranking operators as projections onto the permutahedron, the convex hull of permutations, and using a reduction to isotonic optimization. Empirically, we confirm that our approach is an order of magnitude faster than existing approaches and showcase two novel applications: differentiable Spearman's rank correlation coefficient and soft least trimmed squares.
 [31] arXiv:2002.08943 [pdf, other]

Title: Implicit differentiation of Lassotype models for hyperparameter optimizationAuthors: Quentin Bertrand, Quentin Klopfenstein, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, Joseph SalmonSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Setting regularization parameters for Lassotype estimators is notoriously difficult, though crucial in practice. The most popular hyperparameter optimization approach is gridsearch using heldout validation data. Gridsearch however requires to choose a predefined grid for each parameter, which scales exponentially in the number of parameters. Another approach is to cast hyperparameter optimization as a bilevel optimization problem, one can solve by gradient descent. The key challenge for these methods is the estimation of the gradient with respect to the hyperparameters. Computing this gradient via forward or backward automatic differentiation is possible yet usually suffers from high memory consumption. Alternatively implicit differentiation typically involves solving a linear system which can be prohibitive and numerically unstable in high dimension. In addition, implicit differentiation usually assumes smooth loss functions, which is not the case for Lassotype problems. This work introduces an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lassotype problems. Our approach scales to highdimensional data by leveraging the sparsity of the solutions. Experiments demonstrate that the proposed method outperforms a large number of standard methods to optimize the error on heldout data, or the Stein Unbiased Risk Estimator (SURE).
 [32] arXiv:2002.08948 [pdf, other]

Title: ISPEC: An EndtoEnd Framework for Learning Transportable, ShiftStable ModelsSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Shifts in environment between development and deployment cause classical supervised learning to produce models that fail to generalize well to new target distributions. Recently, many solutions which find invariant predictive distributions have been developed. Among these, graphbased approaches do not require data from the target environment and can capture more stable information than alternative methods which find stable feature sets. However, these approaches assume that the data generating process is known in the form of a full causal graph, which is generally not the case. In this paper, we propose ISPEC, an endtoend framework that addresses this shortcoming by using data to learn a partial ancestral graph (PAG). Using the PAG we develop an algorithm that determines an interventional distribution that is stable to the declared shifts; this subsumes existing approaches which find stable feature sets that are less accurate. We apply ISPEC to a mortality prediction problem to show it can learn a model that is robust to shifts without needing upfront knowledge of the full causal DAG.
Crosslists for Fri, 21 Feb 20
 [33] arXiv:2002.05160 (crosslist from cs.DS) [pdf, other]

Title: Optimal Multiple Stopping Rule for WarmStarting Sequential SelectionSubjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper we present the Warmstarting Dynamic Thresholding algorithm, developed using dynamic programming, for a variant of the standard online selection problem. The problem allows job positions to be either free or already occupied at the beginning of the process. Throughout the selection process, the decision maker interviews one after the other the new candidates and reveals a quality score for each of them. Based on that information, she can (re)assign each job at most once by taking immediate and irrevocable decisions. We relax the hard requirement of the class of dynamic programming algorithms to perfectly know the distribution from which the scores of candidates are drawn, by presenting extensions for the partial and noinformation cases, in which the decision maker can learn the underlying score distribution sequentially while interviewing candidates.
 [34] arXiv:2002.08356 (crosslist from physics.medph) [pdf, other]

Title: Comparative Visual Analytics for Assessing Medical Records with Sequence EmbeddingAuthors: Rongchen Guo, Takanori Fujiwara, Yiran Li, Kelly M. Lima, Soman Sen, Nam K. Tran, KwanLiu MaComments: This manuscript is currently under reviewSubjects: Medical Physics (physics.medph); HumanComputer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)
Machine learning for datadriven diagnosis has been actively studied in medicine to provide better healthcare. Supporting analysis of a patient cohort similar to a patient under treatment is a key task for clinicians to make decisions with high confidence. However, such analysis is not straightforward due to the characteristics of medical records: high dimensionality, irregularity in time, and sparsity. To address this challenge, we introduce a method for similarity calculation of medical records. Our method employs event and sequence embeddings. While we use an autoencoder for the event embedding, we apply its variant with the selfattention mechanism for the sequence embedding. Moreover, in order to better handle the irregularity of data, we enhance the selfattention mechanism with consideration of different time intervals. We have developed a visual analytics system to support comparative studies of patient records. To make a comparison of sequences with different lengths easier, our system incorporates a sequence alignment method. Through its interactive interface, the user can quickly identify patients of interest and conveniently review both the temporal and multivariate aspects of the patient records. We demonstrate the effectiveness of our design and system with case studies using a realworld dataset from the neonatal intensive care unit of UC Davis.
 [35] arXiv:2002.08396 (crosslist from cs.LG) [pdf, other]

Title: Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement LearningAuthors: Noah Y. Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Martin RiedmillerComments: To appear in ICLR 2020Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
Offpolicy reinforcement learning algorithms promise to be applicable in settings where only a fixed dataset (batch) of environment interactions is available and no new experience can be acquired. This property makes these algorithms appealing for real world problems such as robot control. In practice, however, standard offpolicy algorithms fail in the batch setting for continuous control. In this paper, we propose a simple solution to this problem. It admits the use of data generated by arbitrary behavior policies and uses a learned prior  the advantageweighted behavior model (ABM)  to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Our method can be seen as an extension of recent work on batchRL that enables stable learning from conflicting datasources. We find improvements on competitive baselines in a variety of RL tasks  including standard continuous control benchmarks and multitask learning for simulated and realworld robots.
 [36] arXiv:2002.08405 (crosslist from cs.LG) [pdf, other]

Title: Warm Starting Bandits with Side Information from Confounded DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study a variant of the multiarmed bandit problem where side information in the form of bounds on the mean of each arm is provided. We describe how these bounds on the means can be used efficiently for warm starting bandits. Specifically, we propose the novel UCBSI algorithm, and illustrate improvements in cumulative regret over the standard UCB algorithm, both theoretically and empirically, in the presence of nontrivial side information. As noted in (Zhang & Bareinboim, 2017), such information arises, for instance, when we have prior logged data on the arms, but this data has been collected under a policy whose choice of arms is based on latent variables to which access is no longer available. We further provide a novel approach for obtaining such bounds from prior partially confounded data under some mild assumptions. We validate our findings through semisynthetic experiments on data derived from real datasets.
 [37] arXiv:2002.08423 (crosslist from cs.LG) [pdf, other]

Title: PrivacyFL: A simulator for privacypreserving and secure federated learningComments: 15 pagesSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
Federated learning is a technique that enables distributed clients to collaboratively learn a shared machine learning model while keeping their training data localized. This reduces data privacy risks, however, privacy concerns still exist since it is possible to leak information about the training dataset from the trained model's weights or parameters. Setting up a federated learning environment, especially with security and privacy guarantees, is a timeconsuming process with numerous configurations and parameters that can be manipulated. In order to help clients ensure that collaboration is feasible and to check that it improves their model accuracy, a realworld simulator for privacypreserving and secure federated learning is required.
In this paper, we introduce PrivacyFL, which is an extensible, easily configurable and scalable simulator for federated learning environments. Its key features include latency simulation, robustness to client departure, support for both centralized and decentralized learning, and configurable privacy and security mechanisms based on differential privacy and secure multiparty computation.
In this paper, we motivate our research, describe the architecture of the simulator and associated protocols, and discuss its evaluation in numerous scenarios that highlight its wide range of functionality and its advantages. Our paper addresses a significant realworld problem: checking the feasibility of participating in a federated learning environment under a variety of circumstances. It also has a strong practical impact because organizations such as hospitals, banks, and research institutes, which have large amounts of sensitive data and would like to collaborate, would greatly benefit from having a system that enables them to do so in a privacypreserving and secure manner.  [38] arXiv:2002.08456 (crosslist from cs.GT) [pdf, other]

Title: From Poincaré Recurrence to Convergence in Imperfect Information Games: Finding Equilibrium via RegularizationAuthors: Julien Perolat, Remi Munos, JeanBaptiste Lespiau, Shayegan Omidshafiei, Mark Rowland, Pedro Ortega, Neil Burch, Thomas Anthony, David Balduzzi, Bart De Vylder, Georgios Piliouras, Marc Lanctot, Karl TuylsComments: 43 pagesSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper we investigate the Follow the Regularized Leader dynamics in sequential imperfect information games (IIG). We generalize existing results of Poincar\'e recurrence from normalform games to zerosum twoplayer imperfect information games and other sequential game settings. We then investigate how adapting the reward (by adding a regularization term) of the game can give strong convergence guarantees in monotone games. We continue by showing how this reward adaptation technique can be leveraged to build algorithms that converge exactly to the Nash equilibrium. Finally, we show how these insights can be directly used to build stateoftheart modelfree algorithms for zerosum twoplayer Imperfect Information Games (IIG).
 [39] arXiv:2002.08483 (crosslist from cs.LG) [pdf, other]

Title: Strength from Weakness: Fast Learning Using Weak SupervisionComments: 21 pages, 8 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study generalization properties of weakly supervised learning. That is, learning where only a few "strong" labels (the actual target of our prediction) are present but many more "weak" labels are available. In particular, we show that having access to weak labels can significantly accelerate the learning rate for the strong task to the fast rate of $\mathcal{O}(\nicefrac1n)$, where $n$ denotes the number of strongly labeled data points. This acceleration can happen even if by itself the strongly labeled data admits only the slower $\mathcal{O}(\nicefrac{1}{\sqrt{n}})$ rate. The actual acceleration depends continuously on the number of weak labels available, and on the relation between the two tasks. Our theoretical results are reflected empirically across a range of tasks and illustrate how weak labels speed up learning on the strong task.
 [40] arXiv:2002.08484 (crosslist from cs.LG) [pdf, other]

Title: Estimating Training Data Influence by Tracking Gradient DescentSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce a method called TrackIn that computes the influence of a training example on a prediction made by the model, by tracking how the loss on the test point changes during the training process whenever the training example of interest was utilized. We provide a scalable implementation of TrackIn via a combination of a few key ideas: (a) a firstorder approximation to the exact computation, (b) using random projections to speed up the computation of the firstorder approximation for large models, (c) using saved checkpoints of standard training procedures, and (d) cherrypicking layers of a deep neural network. An experimental evaluation shows that TrackIn is more effective in identifying mislabelled training examples than other related methods such as influence functions and representer points. We also discuss insights from applying the method on vision, regression and natural language tasks.
 [41] arXiv:2002.08491 (crosslist from math.NA) [pdf, other]

Title: Entrywise convergence of iterative methods for eigenproblemsComments: 22 pages, 6 figuresSubjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Several problems in machine learning, statistics, and other fields rely on computing eigenvectors. For large scale problems, the computation of these eigenvectors is typically performed via iterative schemes such as subspace iteration or Krylov methods. While there is classical and comprehensive analysis for subspace convergence guarantees with respect to the spectral norm, in many modern applications other notions of subspace distance are more appropriate. Recent theoretical work has focused on perturbations of subspaces measured in the $\ell_{2 \to \infty}$ norm, but does not consider the actual computation of eigenvectors. Here we address the convergence of subspace iteration when distances are measured in the $\ell_{2 \to \infty}$ norm and provide deterministic bounds. We complement our analysis with a practical stopping criterion and demonstrate its applicability via numerical experiments. Our results show that one can get comparable performance on downstream tasks while requiring fewer iterations, thereby saving substantial computational time.
 [42] arXiv:2002.08517 (crosslist from cs.LG) [pdf, other]

Title: Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite NetworksComments: 18 pages, 9 figures, 2 tablesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Analysing and computing with Gaussian processes arising from infinitely wide neural networks has recently seen a resurgence in popularity. Despite this, many explicit covariance functions of networks with activation functions used in modern networks remain unknown. Furthermore, while the kernels of deep networks can be computed iteratively, theoretical understanding of deep kernels is lacking, particularly with respect to fixedpoint dynamics. Firstly, we derive the covariance functions of MLPs with exponential linear units and Gaussian error linear units and evaluate the performance of the limiting Gaussian processes on some benchmarks. Secondly, and more generally, we introduce a framework for analysing the fixedpoint dynamics of iterated kernels corresponding to a broad range of activation functions. We find that unlike some previously studied neural network kernels, these new kernels exhibit nontrivial fixedpoint dynamics which are mirrored in finitewidth neural networks.
 [43] arXiv:2002.08526 (crosslist from cs.LG) [pdf, other]

Title: Scalable Constrained Bayesian OptimizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The global optimization of a highdimensional blackbox function under blackbox constraints is a pervasive task in machine learning, control, and engineering. These problems are difficult since the feasible set is typically nonconvex and hard to find, in addition to the curses of dimensionality and the heterogeneity of the underlying functions. In particular, these characteristics dramatically impact the performance of Bayesian optimization methods, that otherwise have become the defacto standard for sampleefficient optimization in unconstrained settings. Due to the lack of sampleefficient methods, practitioners usually fall back to evolutionary strategies or heuristics. We propose the scalable constrained Bayesian optimization (SCBO) algorithm that addresses the above challenges by dataindependent transformations of the functions and follows the recent theme of local Bayesian optimization. A comprehensive experimental evaluation demonstrates that SCBO achieves excellent results and outperforms the stateoftheart methods.
 [44] arXiv:2002.08528 (crosslist from cs.LG) [pdf, other]

Title: Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed DatasetsSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We study distributed optimization algorithms for minimizing the average of \emph{heterogeneous} functions distributed across several machines with a focus on communication efficiency. In such settings, naively using the classical stochastic gradient descent (SGD) or its variants (e.g., SVRG) with a uniform sampling of machines typically yields poor performance. It often leads to the dependence of convergence rate on maximum Lipschitz constant of gradients across the devices. In this paper, we propose a novel \emph{adaptive} sampling of machines specially catered to these settings. Our method relies on an adaptive estimate of local Lipschitz constants base on the information of past gradients. We show that the new way improves the dependence of convergence rate from maximum Lipschitz constant to \emph{average} Lipschitz constant across machines, thereby, significantly accelerating the convergence. Our experiments demonstrate that our method indeed speeds up the convergence of the standard SVRG algorithm in heterogeneous environments.
 [45] arXiv:2002.08536 (crosslist from cs.LG) [pdf, other]

Title: Safe Counterfactual Reinforcement LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
We develop a method for predicting the performance of reinforcement learning and bandit algorithms, given historical data that may have been generated by a different algorithm. Our estimator has the property that its prediction converges in probability to the true performance of a counterfactual algorithm at the fast $\sqrt{N}$ rate, as the sample size $N$ increases. We also show a correct way to estimate the variance of our prediction, thus allowing the analyst to quantify the uncertainty in the prediction. These properties hold even when the analyst does not know which among a large number of potentially important state variables are really important. These theoretical guarantees make our estimator safe to use. We finally apply it to improve advertisement design by a major advertisement company. We find that our method produces smaller mean squared errors than stateoftheart methods.
 [46] arXiv:2002.08537 (crosslist from math.OC) [pdf, other]

Title: Adaptive Temporal Difference Learning with Linear Function ApproximationSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper revisits the celebrated temporal difference (TD) learning algorithm for the policy evaluation in reinforcement learning. Typically, the performance of the plainvanilla TD algorithm is sensitive to the choice of stepsizes. Oftentimes, TD suffers from slow convergence. Motivated by the tight connection between the TD learning algorithm and the stochastic gradient methods, we develop the first adaptive variant of the TD learning algorithm with linear function approximation that we term AdaTD. In contrast to the original TD, AdaTD is robust or less sensitive to the choice of stepsizes. Analytically, we establish that to reach an $\epsilon$ accuracy, the number of iterations needed is $\tilde{O}(\epsilon^2\ln^4\frac{1}{\epsilon}/\ln^4\frac{1}{\rho})$, where $\rho$ represents the speed of the underlying Markov chain converges to the stationary distribution. This implies that the iteration complexity of AdaTD is no worse than that of TD in the worst case. Going beyond TD, we further develop an adaptive variant of TD($\lambda$), which is referred to as AdaTD($\lambda$). We evaluate the empirical performance of AdaTD and AdaTD($\lambda$) on several standard reinforcement learning tasks in OpenAI Gym on both linear and nonlinear function approximation, which demonstrate the effectiveness of our new approaches over existing ones.
 [47] arXiv:2002.08538 (crosslist from cs.LG) [pdf, other]

Title: Nonasymptotic and Accurate Learning of Nonlinear Dynamical SystemsSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Applications (stat.AP); Machine Learning (stat.ML)
We consider the problem of learning stabilizable systems governed by nonlinear state equation $h_{t+1}=\phi(h_t,u_t;\theta)+w_t$. Here $\theta$ is the unknown system dynamics, $h_t $ is the state, $u_t$ is the input and $w_t$ is the additive noise vector. We study gradient based algorithms to learn the system dynamics $\theta$ from samples obtained from a single finite trajectory. If the system is run by a stabilizing input policy, we show that temporallydependent samples can be approximated by i.i.d. samples via a truncation argument by using mixingtime arguments. We then develop new guarantees for the uniform convergence of the gradients of empirical loss. Unlike existing work, our bounds are noise sensitive which allows for learning groundtruth dynamics with high accuracy and small sample complexity. Together, our results facilitate efficient learning of the general nonlinear system under stabilizing policy. We specialize our guarantees to entrywise nonlinear activations and verify our theory in various numerical experiments
 [48] arXiv:2002.08567 (crosslist from cs.LG) [pdf, other]

Title: MultiAgent MetaReinforcement Learning for SelfPowered and Sustainable Edge Computing SystemsComments: Submitted to IEEE Transactions on Network and Service ManagementSubjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Signal Processing (eess.SP); Machine Learning (stat.ML)
The stringent requirements of mobile edge computing (MEC) applications and functions fathom the high capacity and dense deployment of MEC hosts to the upcoming wireless networks. However, operating such high capacity MEC hosts can significantly increase energy consumption. Thus, a BS unit can act as a selfpowered BS. In this paper, an effective energy dispatch mechanism for selfpowered wireless networks with edge computing capabilities is studied. First, a twostage linear stochastic programming problem is formulated with the goal of minimizing the total energy consumption cost of the system while fulfilling the energy demand. Second, a semidistributed datadriven solution is proposed by developing a novel multiagent metareinforcement learning (MAMRL) framework to solve the formulated problem. In particular, each BS plays the role of a local agent that explores a Markovian behavior for both energy consumption and generation while each BS transfers timevarying features to a metaagent. Sequentially, the metaagent optimizes (i.e., exploits) the energy dispatch decision by accepting only the observations from each local agent with its own state information. Meanwhile, each BS agent estimates its own energy dispatch policy by applying the learned parameters from metaagent. Finally, the proposed MAMRL framework is benchmarked by analyzing deterministic, asymmetric, and stochastic environments in terms of nonrenewable energy usages, energy cost, and accuracy. Experimental results show that the proposed MAMRL model can reduce up to 11% nonrenewable energy usage and by 22.4% the energy cost (with 95.8% prediction accuracy), compared to other baseline methods.
 [49] arXiv:2002.08570 (crosslist from cs.LG) [pdf, other]

Title: Input Perturbation: A New Paradigm between Central and Local Differential PrivacySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Traditionally, there are two models on differential privacy: the central model and the local model. The central model focuses on the machine learning model and the local model focuses on the training data. In this paper, we study the \textit{input perturbation} method in differentially private empirical risk minimization (DPERM), preserving privacy of the central model. By adding noise to the original training data and training with the `perturbed data', we achieve ($\epsilon$,$\delta$)differential privacy on the final model, along with some kind of privacy on the original data. We observe that there is an interesting connection between the local model and the central model: the perturbation on the original data causes the perturbation on the gradient, and finally the model parameters. This observation means that our method builds a bridge between local and central model, protecting the data, the gradient and the model simultaneously, which is more superior than previous central methods. Detailed theoretical analysis and experiments show that our method achieves almost the same (or even better) performance as some of the best previous central methods with more protections on privacy, which is an attractive result. Moreover, we extend our method to a more general case: the loss function satisfies the PolyakLojasiewicz condition, which is more general than strong convexity, the constraint on the loss function in most previous work.
 [50] arXiv:2002.08578 (crosslist from cs.LG) [pdf, other]

Title: Differentially Private ERM Based on Data PerturbationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper, after observing that different training data instances affect the machine learning model to different extents, we attempt to improve the performance of differentially private empirical risk minimization (DPERM) from a new perspective. Specifically, we measure the contributions of various training data instances on the final machine learning model, and select some of them to add random noise. Considering that the key of our method is to measure each data instance separately, we propose a new `Data perturbation' based (DB) paradigm for DPERM: adding random noise to the original training data and achieving ($\epsilon,\delta$)differential privacy on the final machine learning model, along with the preservation on the original data. By introducing the Influence Function (IF), we quantitatively measure the impact of the training data on the final model. Theoretical and experimental results show that our proposed DBDPERM paradigm enhances the model performance significantly.
 [51] arXiv:2002.08583 (crosslist from cs.LG) [pdf, other]

Title: Regret Minimization in Stochastic Contextual Dueling BanditsComments: 28 pages, 11 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We consider the problem of stochastic $K$armed dueling bandit in the contextual setting, where at each round the learner is presented with a context set of $K$ items, each represented by a $d$dimensional feature vector, and the goal of the learner is to identify the best arm of each context sets. However, unlike the classical contextual bandit setup, our framework only allows the learner to receive item feedback in terms of their (noisy) pariwise preferencesfamously studied as dueling bandits which is practical interests in various online decision making scenarios, e.g. recommender systems, information retrieval, tournament ranking, where it is easier to elicit the relative strength of the items instead of their absolute scores. However, to the best of our knowledge this work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal algorithms along with a matching lower bound analysis. We present two algorithms for the setup with respective regret guarantees $\tilde O(d\sqrt{T})$ and $\tilde O(\sqrt{dT \log K})$. Subsequently we also show that $\Omega(\sqrt {dT})$ is actually the fundamental performance limit for this problem, implying the optimality of our second algorithm. However the analysis of our first algorithm is comparatively simpler, and it is often shown to outperform the former empirically. Finally, we corroborate all the theoretical results with suitable experiments.
 [52] arXiv:2002.08595 (crosslist from cs.CV) [pdf, other]

Title: KaoKore: A Premodern Japanese Art Facial Expression DatasetAuthors: Yingtao Tian, Chikahiko Suzuki, Tarin Clanuwat, Mikel BoberIrizar, Alex Lamb, Asanobu KitamotoSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
From classifying handwritten digits to generating strings of text, the datasets which have received longtime focus from the machine learning community vary greatly in their subject matter. This has motivated a renewed interest in building datasets which are socially and culturally relevant, so that algorithmic research may have a more direct and immediate impact on society. One such area is in history and the humanities, where better and relevant machine learning models can accelerate research across various fields. To this end, newly released benchmarks and models have been proposed for transcribing historical Japanese cursive writing, yet for the field as a whole using machine learning for historical Japanese artworks still remains largely uncharted. To bridge this gap, in this work we propose a new dataset KaoKore which consists of faces extracted from premodern Japanese artwork. We demonstrate its value as both a dataset for image classification as well as a creative and artistic dataset, which we explore using generative models. Dataset available at https://github.com/roiscodh/kaokore
 [53] arXiv:2002.08596 (crosslist from cs.LG) [pdf]

Title: Interpretability of machine learning based prediction models in healthcareComments: 12 pages, 2 figures, submitted to Wiley Interdisciplinary Reviews: Data Mining and Knowledge DiscoverySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
There is a need of ensuring machine learning models that are interpretable. Higher interpretability of the model means easier comprehension and explanation of future predictions for endusers. Further, interpretable machine learning models allow healthcare experts to make reasonable and datadriven decisions to provide personalized decisions that can ultimately lead to higher quality of service in healthcare. Generally, we can classify interpretability approaches in two groups where the first focuses on personalized interpretation (local interpretability) while the second summarizes prediction models on a population level (global interpretability). Alternatively, we can group interpretability methods into modelspecific techniques, which are designed to interpret predictions generated by a specific model, such as a neural network, and modelagnostic approaches, which provide easytounderstand explanations of predictions made by any machine learning model. Here, we give an overview of interpretability approaches and provide examples of practical interpretability of machine learning in different areas of healthcare, including prediction of healthrelated outcomes, optimizing treatments or improving the efficiency of screening for specific conditions. Further, we outline future directions for interpretable machine learning and highlight the importance of developing algorithmic solutions that can enable machinelearning driven decision making in highstakes healthcare problems.
 [54] arXiv:2002.08597 (crosslist from eess.SP) [pdf, ps, other]

Title: Kalman Filtering With Censored MeasurementsComments: 14 pages, 3 figuresSubjects: Signal Processing (eess.SP); Methodology (stat.ME)
This paper concerns Kalman filtering when the measurements of the process are censored. The censored measurements are addressed by the Tobit model of Type I and are onedimensional with two censoring limits, while the (hidden) state vectors are multidimensional. For this model, Bayesian estimates for the state vectors are provided through a recursive algorithm of Kalman filtering type. Experiments are presented to illustrate the effectiveness and applicability of the algorithm. The experiments show that the proposed method outperforms other filtering methodologies in minimizing the computational cost as well as the overall Root Mean Square Error (RMSE) for synthetic and real data sets.
 [55] arXiv:2002.08599 (crosslist from cs.LG) [pdf, other]

Title: On Learning Sets of Symmetric ElementsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Learning from unordered sets is a fundamental learning setup, which is attracting increasing attention. Research in this area has focused on the case where elements of the set are represented by feature vectors, and far less emphasis has been given to the common case where set elements themselves adhere to certain symmetries. That case is relevant to numerous applications, from deblurring image bursts to multiview 3D shape recognition and reconstruction.
In this paper, we present a principled approach to learning sets of general symmetric elements. We first characterize the space of linear layers that are equivariant both to element reordering and to the inherent symmetries of elements, like translation in the case of images. We further show that networks that are composed of these layers, called Deep Sets for Symmetric elements layers (DSS), are universal approximators of both invariant and equivariant functions. DSS layers are also straightforward to implement. Finally, we show that they improve over existing setlearning architectures in a series of experiments with images, graphs, and pointclouds.  [56] arXiv:2002.08605 (crosslist from cs.LG) [pdf, other]

Title: Optimizing Blackbox Metrics with Adaptive SurrogatesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We address the problem of training models with blackbox and hardtooptimize metrics by expressing the metric as a monotonic function of a small number of easytooptimize surrogates. We pose the training problem as an optimization over a relaxed surrogate space, which we solve by estimating local gradients for the metric and performing inexact convex projections. We analyze gradient estimates based on finite differences and local linear interpolations, and show convergence of our approach under smoothness assumptions with respect to the surrogates. Experimental results on classification and ranking problems verify the proposal performs on par with methods that know the mathematical formulation, and adds notable value when the form of the metric is unknown.
 [57] arXiv:2002.08616 (crosslist from cs.LG) [pdf, other]

Title: Diversity sampling is an implicit regularization for kernel methodsComments: 27 pagesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Kernel methods have achieved very good performance on large scale regression and classification problems, by using the Nystr\"om method and preconditioning techniques. The Nystr\"om approximation  based on a subset of landmarks  gives a low rank approximation of the kernel matrix, and is known to provide a form of implicit regularization. We further elaborate on the impact of sampling diverse landmarks for constructing the Nystr\"om approximation in supervised as well as unsupervised kernel methods. By using Determinantal Point Processes for sampling, we obtain additional theoretical results concerning the interplay between diversity and regularization. Empirically, we demonstrate the advantages of training kernel methods based on subsets made of diverse points. In particular, if the dataset has a dense bulk and a sparser tail, we show that Nystr\"om kernel regression with diverse landmarks increases the accuracy of the regression in sparser regions of the dataset, with respect to a uniform landmark sampling. A greedy heuristic is also proposed to select diverse samples of significant size within large datasets when exact DPP sampling is not practically feasible.
 [58] arXiv:2002.08619 (crosslist from cs.LG) [pdf, other]

Title: Boosting Adversarial Training with Hypersphere EmbeddingSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Adversarial training (AT) is one of the most effective defenses to improve the adversarial robustness of deep learning models. In order to promote the reliability of the adversarially trained models, we propose to boost AT via incorporating hypersphere embedding (HE), which can regularize the adversarial features onto compact hypersphere manifolds. We formally demonstrate that AT and HE are well coupled, which tunes up the learning dynamics of AT from several aspects. We comprehensively validate the effectiveness and universality of HE by embedding it into the popular AT frameworks including PGDAT, ALP, and TRADES, as well as the FreeAT and FastAT strategies. In experiments, we evaluate our methods on the CIFAR10 and ImageNet datasets, and verify that integrating HE can consistently enhance the performance of the models trained by each AT framework with little extra computation.
 [59] arXiv:2002.08621 (crosslist from cs.LG) [pdf, other]

Title: The Benefits of Pairwise Discriminators for Adversarial TrainingSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Adversarial training methods typically align distributions by solving twoplayer games. However, in most current formulations, even if the generator aligns perfectly with data, a suboptimal discriminator can still drive the two apart. Absent additional regularization, the instability can manifest itself as a neverending game. In this paper, we introduce a family of objectives by leveraging pairwise discriminators, and show that only the generator needs to converge. The alignment, if achieved, would be preserved with any discriminator. We provide sufficient conditions for local convergence; characterize the capacity balance that should guide the discriminator and generator choices; and construct examples of minimally sufficient discriminators. Empirically, we illustrate the theory and the effectiveness of our approach on synthetic examples. Moreover, we show that practical methods derived from our approach can better generate higherresolution images.
 [60] arXiv:2002.08641 (crosslist from cs.LG) [pdf]

Title: A Novel Framework for Selection of GANs for an ApplicationComments: 23 pages, 1 figures, 7 tablesSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Generative Adversarial Network (GAN) is a current focal point of research. The body of knowledge is fragmented, leading to a trialerror method while selecting an appropriate GAN for a given scenario. We provide a comprehensive summary of the evolution of GANs starting from its inception addressing issues like mode collapse, vanishing gradient, unstable training and nonconvergence. We also provide a comparison of various GANs from the application point of view, its behaviour and implementation details. We propose a novel framework to identify candidate GANs for a specific use case based on architecture, loss, regularization and divergence. We also discuss application of the framework using an example, and we demonstrate a significant reduction in search space. This efficient way to determine potential GANs lowers unit economics of AI development for organizations.
 [61] arXiv:2002.08643 (crosslist from cs.LG) [pdf, other]

Title: Embedding Graph AutoEncoder with Joint Clustering via Adjacency SharingComments: 11 pages containing appendixSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Graph convolution networks have attracted many attentions and several graph autoencoder based clustering models are developed for attributed graph clustering. However, most existing approaches separate clustering and optimization of graph autoencoder into two individual steps. In this paper, we propose a graph convolution network based clustering model, namely, Embedding Graph AutoEncoder with JOint Clustering via Adjacency Sharing (\textit{EGAEJOCAS}). As for the embedded model, we develop a novel joint clustering method, which combines relaxed kmeans and spectral clustering and is applicable for the learned embedding. The proposed joint clustering shares the same adjacency within graph convolution layers. Two parts are optimized simultaneously through performing SGD and taking closeform solutions alternatively to ensure a rapid convergence. Moreover, our model is free to incorporate any mechanisms (e.g., attention) into graph autoencoder. Extensive experiments are conducted to prove the superiority of EGAEJOCAS. Sufficient theoretical analyses are provided to support the results.
 [62] arXiv:2002.08645 (crosslist from cs.LG) [pdf, other]

Title: Uncovering Coresets for Classification With MultiObjective Evolutionary AlgorithmsComments: 9 pages, 3 figures, conference. Submitted to ICML 2020Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
A coreset is a subset of the training set, using which a machine learning algorithm obtains performances similar to what it would deliver if trained over the whole original data. Coreset discovery is an active and open line of research as it allows improving training speed for the algorithms and may help human understanding the results. Building on previous works, a novel approach is presented: candidate corsets are iteratively optimized, adding and removing samples. As there is an obvious tradeoff between limiting training size and quality of the results, a multiobjective evolutionary algorithm is used to minimize simultaneously the number of points in the set and the classification error. Experimental results on nontrivial benchmarks show that the proposed approach is able to deliver results that allow a classifier to obtain lower error and better ability of generalizing on unseen data than stateoftheart coreset discovery techniques.
 [63] arXiv:2002.08648 (crosslist from cs.LG) [pdf, other]

Title: Adaptive Graph AutoEncoder for General Data ClusteringComments: 11 pages containing one page supplementarySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Graph based clustering plays an important role in clustering area. Recent studies about graph convolution neural networks have achieved impressive success on graph type data. However, in traditional clustering tasks, the graph structure of data does not exist such that the strategy to construct graph is crucial for performance. In addition, the existing graph autoencoder based approaches perform poorly on weighted graph, which is widely used in graph based clustering. In this paper, we propose a graph autoencoder with local structure preserving for general data clustering, which can update the constructed graph adaptively. The adaptive process is designed to utilize the nonEuclidean structure sufficiently. By combining generative model for graph embedding and graph based clustering, a graph autoencoder with a novel decoder is developed and it performs well in weighted graph used scenarios. Extensive experiments prove the superiority of our model.
 [64] arXiv:2002.08665 (crosslist from cs.LG) [pdf, other]

Title: Computationally Tractable Riemannian Manifolds for Graph EmbeddingsComments: Submitted to International Conference on Machine Learning (ICML) 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Representing graphs as sets of node embeddings in certain curved Riemannian manifolds has recently gained momentum in machine learning due to their desirable geometric inductive biases, e.g., hierarchical structures benefit from hyperbolic geometry. However, going beyond embedding spaces of constant sectional curvature, while potentially more representationally powerful, proves to be challenging as one can easily lose the appeal of computationally tractable tools such as geodesic distances or Riemannian gradients. Here, we explore computationally efficient matrix manifolds, showcasing how to learn and optimize graph embeddings in these Riemannian spaces. Empirically, we demonstrate consistent improvements over Euclidean geometry while often outperforming hyperbolic and elliptical embeddings based on various metrics that capture different graph properties. Our results serve as new evidence for the benefits of nonEuclidean embeddings in machine learning pipelines.
 [65] arXiv:2002.08675 (crosslist from cs.LG) [pdf, other]

Title: Unsupervised Domain Adaptation via Discriminative Manifold Embedding and AlignmentComments: Accepted to AAAI 2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Unsupervised domain adaptation is effective in leveraging the rich information from the source domain to the unsupervised target domain. Though deep learning and adversarial strategy make an important breakthrough in the adaptability of features, there are two issues to be further explored. First, the hardassigned pseudo labels on the target domain are risky to the intrinsic data structure. Second, the batchwise training manner in deep learning limits the description of the global structure. In this paper, a Riemannian manifold learning framework is proposed to achieve transferability and discriminability consistently. As to the first problem, this method establishes a probabilistic discriminant criterion on the target domain via soft labels. Further, this criterion is extended to a global approximation scheme for the second issue; such approximation is also memorysaving. The manifold metric alignment is exploited to be compatible with the embedding space. A theoretical error bound is derived to facilitate the alignment. Extensive experiments have been conducted to investigate the proposal and results of the comparison study manifest the superiority of consistent manifold learning framework.
 [66] arXiv:2002.08676 (crosslist from cs.LG) [pdf, other]

Title: Learning with Differentiable Perturbed OptimizersAuthors: Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, JeanPhilippe Vert, Francis BachSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Machine learning pipelines often rely on optimization procedures to make discrete decisions (e.g. sorting, picking closest neighbors, finding shortest paths or optimal matchings). Although these discrete decisions are easily computed in a forward manner, they cannot be used to modify model parameters using firstorder optimization techniques because they break the backpropagation of computational graphs. In order to expand the scope of learning problems that can be solved in an endtoend fashion, we propose a systematic method to transform a block that outputs an optimal discrete decision into a differentiable operation. Our approach relies on stochastic perturbations of these parameters, and can be used readily within existing solvers without the need for ad hoc regularization or smoothing. These perturbed optimizers yield solutions that are differentiable and never locally constant. The amount of smoothness can be tuned via the chosen noise amplitude, whose impact we analyze. The derivatives of these perturbed solvers can be evaluated efficiently. We also show how this framework can be connected to a family of losses developed in structured prediction, and describe how these can be used in unsupervised and supervised learning, with theoretical guarantees. We demonstrate the performance of our approach on several machine learning tasks in experiments on synthetic and real data.
 [67] arXiv:2002.08681 (crosslist from cs.LG) [pdf, other]

Title: Unsupervised MultiClass Domain Adaptation: Theory, Algorithms, and PracticeComments: The journal manuscript extended significantly from our preliminary CVPR conference paper. Codes are available at: this https URLSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
In this paper, we study the formalism of unsupervised multiclass domain adaptation (multiclass UDA), which underlies some recent algorithms whose learning objectives are only motivated empirically. A MultiClass Scoring Disagreement (MCSD) divergence is presented by aggregating the absolute margin violations in multiclass classification; the proposed MCSD is able to fully characterize the relations between any pair of multiclass scoring hypotheses. By using MCSD as a measure of domain distance, we develop a new domain adaptation bound for multiclass UDA as well as its datadependent, probably approximately correct bound, which naturally suggest adversarial learning objectives to align conditional feature distributions across the source and target domains. Consequently, an algorithmic framework of Multiclass Domainadversarial learning Networks (McDalNets) is developed, whose different instantiations via surrogate learning objectives either coincide with or resemble a few of recently popular methods, thus (partially) underscoring their practical effectiveness. Based on our same theory of multiclass UDA, we also introduce a new algorithm of DomainSymmetric Networks (SymmNets), which is featured by a novel adversarial strategy of domain confusion and discrimination. SymmNets afford simple extensions that work equally well under the problem settings of either closed set, partial, or open set UDA. We conduct careful empirical studies to compare different algorithms of McDalNets and our newly introduced SymmNets. Experiments verify our theoretical analysis and show the efficacy of our proposed SymmNets. We make our implementation codes publicly available.
 [68] arXiv:2002.08695 (crosslist from cs.LG) [pdf, other]

Title: Stochastic Optimization for Regularized Wasserstein EstimatorsSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Optimal transport is a foundational problem in optimization, that allows to compare probability distributions while taking into account geometric aspects. Its optimal objective value, the Wasserstein distance, provides an important loss between distributions that has been used in many applications throughout machine learning and statistics. Recent algorithmic progress on this problem and its regularized versions have made these tools increasingly popular. However, existing techniques require solving an optimization problem to obtain a single gradient of the loss, thus slowing down firstorder methods to minimize the sum of losses, that require many such gradient computations. In this work, we introduce an algorithm to solve a regularized version of this problem of Wasserstein estimators, with a time per step which is sublinear in the natural dimensions of the problem. We introduce a dual formulation, and optimize it with stochastic gradient steps that can be computed directly from samples, without solving additional optimization problems at each step. Doing so, the estimation and computation tasks are performed jointly. We show that this algorithm can be extended to other tasks, including estimation of Wasserstein barycenters. We provide theoretical guarantees and illustrate the performance of our algorithm with experiments on synthetic data.
 [69] arXiv:2002.08697 (crosslist from cs.LG) [pdf, other]

Title: Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUsAuthors: Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, Jose Cano, Elliot J. Crowley, Bjorn Franke, Amos Storkey, Michael O'BoyleComments: A copy of this was published in IISWC'19Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2x slowdown. On the other hand, we also find examples where performanceaware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM. Our findings expose the need for hardwareinstructed neural network pruning.
 [70] arXiv:2002.08709 (crosslist from cs.LG) [pdf, other]

Title: Do We Need Zero Training Loss After Achieving Zero Training Error?Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Overparameterized deep networks have the capacity to memorize training data with zero training error. Even after memorization, the training loss continues to approach zero, making the model overconfident and the test performance degraded. Since existing regularizers do not directly aim to avoid zero training loss, they often fail to maintain a moderate level of training loss, ending up with a too small or too large loss. We propose a direct solution called flooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value, which we call the flooding level. Our approach makes the loss float around the flooding level by doing minibatched gradient descent as usual but gradient ascent if the training loss is below the flooding level. This can be implemented with one line of code, and is compatible with any stochastic optimizer and other regularizers. With flooding, the model will continue to "random walk" with the same nonzero training loss, and we expect it to drift into an area with a flat loss landscape that leads to better generalization. We experimentally show that flooding improves performance and as a byproduct, induces a double descent curve of the test loss.
 [71] arXiv:2002.08717 (crosslist from math.OC) [pdf, ps, other]

Title: The Directional Optimal TransportComments: 30 pages, 5 figuresSubjects: Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
We introduce a constrained optimal transport problem where origins $x$ can only be transported to destinations $y\geq x$. Our statistical motivation is to describe the sharp upper bound for the variance of the treatment effect $YX$ given marginals when the effect is monotone, or $Y\geq X$. We thus focus on supermodular costs (or submodular rewards) and introduce a coupling $P_{*}$ that is optimal for all such costs and yields the sharp bound. This coupling admits manifold characterizationsgeometric, ordertheoretic, as optimal transport, through the cdf, and via the transport kernelthat explain its structure and imply useful bounds. When the first marginal is atomless, $P_{*}$ is concentrated on the graphs of two maps which can be described in terms of the marginals, the second map arising due to the binding constraint.
 [72] arXiv:2002.08740 (crosslist from cs.LG) [pdf, other]

Title: Towards Certifiable Adversarial Sample DetectionSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Convolutional Neural Networks (CNNs) are deployed in more and more classification systems, but adversarial samples can be maliciously crafted to trick them, and are becoming a real threat. There have been various proposals to improve CNNs' adversarial robustness but these all suffer performance penalties or other limitations. In this paper, we provide a new approach in the form of a certifiable adversarial detection scheme, the Certifiable Taboo Trap (CTT). The system can provide certifiable guarantees of detection of adversarial inputs for certain $l_{\infty}$ sizes on a reasonable assumption, namely that the training data have the same distribution as the test data. We develop and evaluate several versions of CTT with a range of defense capabilities, training overheads and certifiability on adversarial samples. Against adversaries with various $l_p$ norms, CTT outperforms existing defense methods that focus purely on improving network robustness. We show that CTT has small false positive rates on clean test data, minimal compute overheads when deployed, and can support complex security policies.
 [73] arXiv:2002.08762 (crosslist from cs.LG) [pdf, other]

Title: Error detection in Knowledge Graphs: Path Ranking, Embeddings or both?Comments: 19 pages, 3 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)
This paper attempts to compare and combine different approaches for detecting errors in Knowledge Graphs. Knowledge Graphs constitute a mainstreamapproach for the representation of relational information on big heterogeneous data,however, they may contain a big amount of imputed noise when constructed automatically. To address this problem, different error detection methodologies have beenproposed, mainly focusing on path ranking and representation learning. This workpresents various mainstream approaches and proposes a novel hybrid and modularmethodology for the task. We compare these methods on two benchmarks and one realworld biomedical publications dataset, showcasing the potential of our approach anddrawing insights regarding the stateofart in error detection in Knowledge Graphs
 [74] arXiv:2002.08772 (crosslist from cs.LG) [pdf, other]

Title: Set2Graph: Learning Graphs From SetsAuthors: Hadar Serviansky, Nimrod Segol, Jonathan Shlomi, Kyle Cranmer, Eilam Gross, Haggai Maron, Yaron LipmanSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Many problems in machine learning (ML) can be cast as learning functions from sets to graphs, or more generally to hypergraphs; in short, Set2Graph functions. Examples include clustering, learning vertex and edge features on graphs, and learning triplet data in a collection. Current neural network models that approximate Set2Graph functions come from two main ML subfields: equivariant learning, and similarity learning. Equivariant models would be in general computationally challenging or even infeasible, while similarity learning models can be shown to have limited expressive power. In this paper we suggest a neural network model family for learning Set2Graph functions that is both practical and of maximal expressive power (universal), that is, can approximate arbitrary continuous Set2Graph functions over compact sets. Testing our models on different machine learning tasks, including an application to particle physics, we find them favorable to existing baselines.
 [75] arXiv:2002.08782 (crosslist from cs.LG) [pdf, other]

Title: Dynamic Federated LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Federated learning has emerged as an umbrella term for centralized coordination strategies in multiagent environments. While many federated learning architectures process data in an online manner, and are hence adaptive by nature, most performance analyses assume static optimization problems and offer no guarantees in the presence of drifts in the problem solution or data characteristics. We consider a federated learning model where at every iteration, a random subset of available agents perform local updates based on their data. Under a nonstationary random walk model on the true minimizer for the aggregate optimization problem, we establish that the performance of the architecture is determined by three factors, namely, the data variability at each agent, the model variability across all agents, and a tracking term that is inversely proportional to the learning rate of the algorithm. The results clarify the tradeoff between convergence and tracking performance.
 [76] arXiv:2002.08791 (crosslist from cs.LG) [pdf, other]

Title: Bayesian Deep Learning and a Probabilistic Perspective of GeneralizationComments: 27 pages, 17 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The key distinguishing property of a Bayesian approach is marginalization, rather than using a single setting of weights. Bayesian marginalization can particularly improve the accuracy and calibration of modern deep neural networks, which are typically underspecified by the data, and can represent many compelling but different solutions. We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction, without significant overhead. We also investigate the prior over functions implied by a vague distribution over neural network weights, explaining the generalization properties of such models from a probabilistic perspective. From this perspective, we explain results that have been presented as mysterious and distinct to neural network generalization, such as the ability to fit images with random labels, and show that these results can be reproduced with Gaussian processes. Finally, we provide a Bayesian perspective on tempering for calibrating predictive distributions.
 [77] arXiv:2002.08799 (crosslist from cs.LG) [pdf, other]

Title: A Structured Prediction Approach for Conditional MetaLearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Optimizationbased metalearning algorithms are a powerful class of methods for learningtolearn applications such as fewshot learning. They tackle the limited availability of training data by leveraging the experience gained from previously observed tasks. However, when the complexity of the tasks distribution cannot be captured by a single set of shared metaparameters, existing methods may fail to fully adapt to a target task. We address this issue with a novel perspective on conditional metalearning based on structured prediction. We propose taskadaptive structured metalearning (TASML), a principled estimator that weighs metatraining data conditioned on the target task to design tailored metalearning objectives. In addition, we introduce algorithmic improvements to tackle key computational limitations of existing methods. Experimentally, we show that TASML outperforms stateoftheart methods on benchmark datasets both in terms of accuracy and efficiency. An ablation study quantifies the individual contribution of model components and suggests useful practices for metalearning.
 [78] arXiv:2002.08803 (crosslist from cs.LG) [pdf, other]

Title: Supportweighted Adversarial Imitation LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Adversarial Imitation Learning (AIL) is a broad family of imitation learning methods designed to mimic expert behaviors from demonstrations. While AIL has shown stateoftheart performance on imitation learning with only small number of demonstrations, it faces several practical challenges such as potential training instability and implicit reward bias. To address the challenges, we propose Supportweighted Adversarial Imitation Learning (SAIL), a general framework that extends a given AIL algorithm with information derived from support estimation of the expert policies. SAIL improves the quality of the reinforcement signals by weighing the adversarial reward with a confidence score from support estimation of the expert policy. We also show that SAIL is always at least as efficient as the underlying AIL algorithm that SAIL uses for learning the adversarial reward. Empirically, we show that the proposed method achieves better performance and training stability than baseline methods on a wide range of benchmark control tasks.
 [79] arXiv:2002.08831 (crosslist from math.NA) [pdf, ps, other]

Title: Efficiently updating a covariance matrix and its LDL decompositionSubjects: Numerical Analysis (math.NA); Computation (stat.CO)
Equations are presented which efficiently update or downdate the covariance matrix of a large number of $m$dimensional observations. Updates and downdates to the covariance matrix, as well as mixed updates/downdates, are shown to be rank$k$ modifications, where $k$ is the number of new observations added plus the number of old observations removed. As a result, the update and downdate equations decrease the required number of multiplications for a modification to $\Theta((k+1)m^2)$ instead of $\Theta((n+k+1)m^2)$ or $\Theta((nk+1)m^2)$, where $n$ is the number of initial observations. Having the rank$k$ formulas for the updates also allows a number of other known identities to be applied, providing a way of applying updates and downdates directly to the inverse and decompositions of the covariance matrix. To illustrate, we provide an efficient algorithm for applying the rank$k$ update to the LDL decomposition of a covariance matrix.
 [80] arXiv:2002.08837 (crosslist from cs.LG) [pdf, other]

Title: NoRegret and IncentiveCompatible Online LearningSubjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
We study online learning settings in which experts act strategically to maximize their influence on the learning algorithm's predictions by potentially misreporting their beliefs about a sequence of binary events. Our goal is twofold. First, we want the learning algorithm to be noregret with respect to the best fixed expert in hindsight. Second, we want incentive compatibility, a guarantee that each expert's best strategy is to report his true beliefs about the realization of each event. To achieve this goal, we build on the literature on wagering mechanisms, a type of multiagent scoring rule. We provide algorithms that achieve no regret and incentive compatibility for myopic experts for both the full and partial information settings. In experiments on datasets from FiveThirtyEight, our algorithms have regret comparable to classic noregret algorithms, which are not incentivecompatible. Finally, we identify an incentivecompatible algorithm for forwardlooking strategic agents that exhibits diminishing regret in practice.
 [81] arXiv:2002.08838 (crosslist from cs.LG) [pdf, other]

Title: On the Decision Boundaries of Deep Neural Networks: A Tropical Geometry PerspectiveSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This work tackles the problem of characterizing and understanding the decision boundaries of neural networks with piecewise linear nonlinearity activations. We use tropical geometry, a new development in the area of algebraic geometry, to characterize the decision boundaries of a simple neural network of the form (Affine, ReLU, Affine). Our main finding is that the decision boundaries are a subset of a tropical hypersurface, which is intimately related to a polytope formed by the convex hull of two zonotopes. The generators of these zonotopes are functions of the neural network parameters. This geometric characterization provides new perspective to three tasks. Specifically, we propose a new tropical perspective to the lottery ticket hypothesis, where we see the effect of different initializations on the tropical geometric representation of a network's decision boundaries. Moreover, we use this characterization to propose a new set of tropical regularizers, which directly deal with the decision boundaries of a network. We investigate the use of these regularizers in neural network pruning (by removing network parameters that do not contribute to the tropical geometric representation of the decision boundaries) and in generating adversarial input attacks (by producing input perturbations that explicitly perturb the decision boundaries' geometry and ultimately change the network's prediction).
 [82] arXiv:2002.08849 (crosslist from qfin.ST) [pdf, other]

Title: Forecasting Realized Volatility Matrix With CopulaBased ModelsComments: 26 pages, 3 figuresSubjects: Statistical Finance (qfin.ST); Applications (stat.AP)
Multivariate volatility modeling and forecasting are crucial in financial economics. This paper develops a copulabased approach to model and forecast realized volatility matrices. The proposed copulabased time series models can capture the hidden dependence structure of realized volatility matrices. Also, this approach can automatically guarantee the positive definiteness of the forecasts through either Cholesky decomposition or matrix logarithm transformation. In this paper we consider both multivariate and bivariate copulas; the types of copulas include Student's t, Clayton and Gumbel copulas. In an empirical application, we find that for oneday ahead volatility matrix forecasting, these copulabased models can achieve significant performance both in terms of statistical precision as well as creating economically meanvariance efficient portfolio. Among the copulas we considered, the multivariatet copula performs better in statistical precision, while bivariatet copula has better economical performance.
 [83] arXiv:2002.08856 (crosslist from math.OC) [pdf, ps, other]

Title: Bounding the expected runtime of nonconvex optimization with early stoppingSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
This work examines the convergence of stochastic gradientbased optimization algorithms that use early stopping based on a validation function. The form of early stopping we consider is that optimization terminates when the norm of the gradient of a validation function falls below a threshold. We derive conditions that guarantee this stopping rule is welldefined, and provide bounds on the expected number of iterations and gradient evaluations needed to meet this criterion. The guarantee accounts for the distance between the training and validation sets, measured with the Wasserstein distance. We develop the approach in the general setting of a firstorder optimization algorithm, with possibly biased update directions subject to a geometric drift condition. We then derive bounds on the expected running time for early stopping variants of several algorithms, including stochastic gradient descent (SGD), decentralized SGD (DSGD), and the stochastic variance reduced gradient (SVRG) algorithm. Finally, we consider the generalization properties of the iterate returned by early stopping.
 [84] arXiv:2002.08859 (crosslist from cs.LG) [pdf, other]

Title: A BayesOptimal View on Adversarial ExamplesSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
The ability to fool modern CNN classifiers with tiny perturbations of the input has lead to the development of a large number of candidate defenses and often conflicting explanations. In this paper, we argue for examining adversarial examples from the perspective of BayesOptimal classification. We construct realistic image datasets for which the BayesOptimal classifier can be efficiently computed and derive analytic conditions on the distributions so that the optimal classifier is either robust or vulnerable. By training different classifiers on these datasets (for which the "gold standard" optimal classifiers are known), we can disentangle the possible sources of vulnerability and avoid the accuracyrobustness tradeoff that may occur in commonly used datasets. Our results show that even when the optimal classifier is robust, standard CNN training consistently learns a vulnerable classifier. At the same time, for exactly the same training data, RBF SVMs consistently learn a robust classifier. The same trend is observed in experiments with real images.
 [85] arXiv:2002.08860 (crosslist from cs.LG) [pdf, other]

Title: Dissipative SymODEN: Encoding Hamiltonian Dynamics with Dissipation and Control into Deep LearningSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
In this work, we introduce Dissipative SymODEN, a deep learning architecture which can infer the dynamics of a physical system with dissipation from observed state trajectories. To improve prediction accuracy while reducing network size, Dissipative SymODEN encodes the portHamiltonian dynamics with energy dissipation and external input into the design of its computation graph and learns the dynamics in a structured way. The learned model, by revealing key aspects of the system, such as the inertia, dissipation, and potential energy, paves the way for energybased controllers.
 [86] arXiv:2002.08898 (crosslist from cs.CL) [pdf, other]

Title: MADST: MultiAttention Based Scalable Dialog State TrackingComments: ThirtyFourth AAAI Conference on Artificial Intelligence (AAAI 2020)Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Task oriented dialog agents provide a natural language interface for users to complete their goal. Dialog State Tracking (DST), which is often a core component of these systems, tracks the system's understanding of the user's goal throughout the conversation. To enable accurate multidomain DST, the model needs to encode dependencies between past utterances and slot semantics and understand the dialog context, including longrange crossdomain references. We introduce a novel architecture for this task to encode the conversation history and slot semantics more robustly by using attention mechanisms at multiple granularities. In particular, we use crossattention to model relationships between the context and slots at different semantic levels and selfattention to resolve crossdomain coreferences. In addition, our proposed architecture does not rely on knowing the domain ontologies beforehand and can also be used in a zeroshot setting for new domains or unseen slot values. Our model improves the joint goal accuracy by 5% (absolute) in the fulldata setting and by up to 2% (absolute) in the zeroshot setting over the present stateoftheart on the MultiWoZ 2.1 dataset.
 [87] arXiv:2002.08902 (crosslist from cs.CL) [pdf, other]

Title: Application of Pretraining Models in Named Entity RecognitionSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task to extract entities from unstructured data. The previous methods for NER were based on machine learning or deep learning. Recently, pretraining models have significantly improved performance on multiple NLP tasks. In this paper, firstly, we introduce the architecture and pretraining tasks of four common pretraining models: BERT, ERNIE, ERNIE2.0tiny, and RoBERTa. Then, we apply these pretraining models to a NER task by finetuning, and compare the effects of the different model architecture and pretraining tasks on the NER task. The experiment results showed that RoBERTa achieved stateoftheart results on the MSRA2006 dataset.
 [88] arXiv:2002.08907 (crosslist from math.OC) [pdf, other]

Title: Secondorder Conditional GradientsSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
Constrained secondorder convex optimization algorithms are the method of choice when a high accuracy solution to a problem is needed, due to the quadratic convergence rates these methods enjoy when close to the optimum. These algorithms require the solution of a constrained quadratic subproblem at every iteration. In the case where the feasible region can only be accessed efficiently through a linear optimization oracle, and computing firstorder information about the function, although possible, is costly, the coupling of constrained secondorder and conditional gradient algorithms leads to competitive algorithms with solid theoretical guarantees and good numerical performance.
 [89] arXiv:2002.08910 (crosslist from cs.CL) [pdf, other]

Title: How Much Knowledge Can You Pack Into the Parameters of a Language Model?Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the practical utility of this approach by finetuning pretrained models to answer questions without access to any external context or knowledge. We show that this approach scales surprisingly well with model size and outperforms models that explicitly look up knowledge on the opendomain variants of Natural Questions and WebQuestions.
 [90] arXiv:2002.08927 (crosslist from cs.LG) [pdf, other]

Title: Regularized Autoencoders via Relaxed Injective Probability FlowComments: AISTATS 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Invertible flowbased generative models are an effective method for learning to generate samples, while allowing for tractable likelihood computation and inference. However, the invertibility requirement restricts models to have the same latent dimensionality as the inputs. This imposes significant architectural, memory, and computational costs, making them more challenging to scale than other classes of generative models such as Variational Autoencoders (VAEs). We propose a generative model based on probability flows that does away with the bijectivity requirement on the model and only assumes injectivity. This also provides another perspective on regularized autoencoders (RAEs), with our final objectives resembling RAEs with specific regularizers that are derived by lower bounding the probability flow objective. We empirically demonstrate the promise of the proposed model, improving over VAEs and AEs in terms of sample quality.
 [91] arXiv:2002.08930 (crosslist from cs.LG) [pdf, other]

Title: Multistep Online Unsupervised Domain AdaptationComments: To appear in ICASSP 2020. Copyright 2020 IEEESubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper, we address the Online Unsupervised Domain Adaptation (OUDA) problem, where the target data are unlabelled and arriving sequentially. The traditional methods on the OUDA problem mainly focus on transforming each arriving target data to the source domain, and they do not sufficiently consider the temporal coherency and accumulative statistics among the arriving target data. We propose a multistep framework for the OUDA problem, which institutes a novel method to compute the meantarget subspace inspired by the geometrical interpretation on the Euclidean space. This meantarget subspace contains accumulative temporal information among the arrived target data. Moreover, the transformation matrix computed from the meantarget subspace is applied to the next target data as a preprocessing step, aligning the target data closer to the source domain. Experiments on four datasets demonstrated the contribution of each step in our proposed multistep OUDA framework and its performance over previous approaches.
 [92] arXiv:2002.08933 (crosslist from eess.AS) [pdf, other]

Title: Wavesplit: EndtoEnd Speech Separation by Speaker ClusteringSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Machine Learning (stat.ML)
We introduce Wavesplit, an endtoend speech separation system. From a single recording of mixed speech, the model infers and clusters representations of each speaker and then estimates each source signal conditioned on the inferred representations. The model is trained on the raw waveform to jointly perform the two tasks. Our model infers a set of speaker representations through clustering, which addresses the fundamental permutation problem of speech separation. Moreover, the sequencewide speaker representations provide a more robust separation of long, challenging sequences, compared to previous approaches. We show that Wavesplit outperforms the previous stateoftheart on clean mixtures of 2 or 3 speakers (WSJ02mix, WSJ03mix), as well as in noisy (WHAM!) and reverberated (WHAMR!) conditions. As an additional contribution, we further improve our model by introducing online data augmentation for separation.
 [93] arXiv:2002.08934 (crosslist from cs.LG) [pdf, ps, other]

Title: Online high rank matrix completionComments: The paper was published by the proceedings of IEEE CVPR 2019Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Recent advances in matrix completion enable data imputation in fullrank matrices by exploiting low dimensional (nonlinear) latent structure. In this paper, we develop a new model for high rank matrix completion (HRMC), together with batch and online methods to fit the model and outofsample extension to complete new data. The method works by (implicitly) mapping the data into a high dimensional polynomial feature space using the kernel trick; importantly, the data occupies a low dimensional subspace in this feature space, even when the original data matrix is of fullrank. We introduce an explicit parametrization of this low dimensional subspace, and an online fitting procedure, to reduce computational complexity compared to the state of the art. The online method can also handle streaming or sequential data and adapt to nonstationary latent structure. We provide guidance on the sampling rate required these methods to succeed. Experimental results on synthetic data and motion capture data validate the performance of the proposed methods.
 [94] arXiv:2002.08936 (crosslist from cs.LG) [pdf, other]

Title: Metalearning for mixed linear regressionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In modern supervised learning, there are a large number of tasks, but many of them are associated with only a small amount of labeled data. These include data from medical image processing and robotic interaction. Even though each individual task cannot be meaningfully trained in isolation, one seeks to metalearn across the tasks from past experiences by exploiting some similarities. We study a fundamental question of interest: When can abundant tasks with small data compensate for lack of tasks with big data? We focus on a canonical scenario where each task is drawn from a mixture of $k$ linear regressions, and identify sufficient conditions for such a graceful exchange to hold; The total number of examples necessary with only small data tasks scales similarly as when big data tasks are available. To this end, we introduce a novel spectral approach and show that we can efficiently utilize small data tasks with the help of $\tilde\Omega(k^{3/2})$ medium data tasks each with $\tilde\Omega(k^{1/2})$ examples.
 [95] arXiv:2002.08937 (crosslist from cs.LG) [pdf, other]

Title: Nyström Subspace Learning for Largescale SVMsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
As an implementation of the Nystr\"{o}m method, Nystr\"{o}m computational regularization (NCR) imposed on kernel classification and kernel ridge regression has proven capable of achieving optimal bounds in the largescale statistical learning setting, while enjoying much better time complexity. In this study, we propose a Nystr\"{o}m subspace learning (NSL) framework to reveal that all you need for employing the Nystr\"{o}m method, including NCR, upon any kernel SVM is to use the efficient offtheshelf linear SVM solvers as a black box. Based on our analysis, the bounds developed for the Nystr\"{o}m method are linked to NSL, and the analytical difference between two distinct implementations of the Nystr\"{o}m method is clearly presented. Besides, NSL also leads to sharper theoretical results for the clustered Nystr\"{o}m method. Finally, both regression and classification tasks are performed to compare two implementations of the Nystr\"{o}m method.
 [96] arXiv:2002.08949 (crosslist from cs.LG) [pdf, other]

Title: Improving Sampling Accuracy of Stochastic Gradient MCMC Methods via Nonuniform Subsampling of GradientsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Common Stochastic Gradient MCMC methods approximate gradients by stochastic ones via uniformly subsampled data points. We propose that a nonuniform subsampling can reduce the variance introduced by the stochastic approximation, hence making the sampling of a target distribution more accurate. An exponentially weighted stochastic gradient approach (EWSG) is developed for this objective by matching the transition kernels of SGMCMC methods respectively based on stochastic and batch gradients. A demonstration of EWSG combined with secondorder Langevin equation for sampling purposes is provided. In our method, nonuniform subsampling is done efficiently via a MetropolisHasting chain on the data index, which is coupled to the sampling algorithm. The fact that our method has reduced local variance with high probability is theoretically analyzed. A nonasymptotic global error analysis is also presented. Numerical experiments based on both synthetic and real world data sets are also provided to demonstrate the efficacy of the proposed approaches. While statistical accuracy has improved, the speed of convergence was empirically observed to be at least comparable to the uniform version.
Replacements for Fri, 21 Feb 20
 [97] arXiv:1406.5958 (replaced) [pdf, other]

Title: Prior sample size extensions for assessing prior informativeness and priorlikelihood discordanceSubjects: Methodology (stat.ME)
 [98] arXiv:1709.05545 (replaced) [pdf, other]

Title: Generating Compact Tree Ensembles via AnnealingComments: Comparison with Random Forest included in the results sectionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [99] arXiv:1802.02212 (replaced) [pdf, other]

Title: Classification and Disease Localization in Histopathology Using Only Global Labels: A WeaklySupervised ApproachSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [100] arXiv:1804.05464 (replaced) [pdf, other]

Title: On GradientBased Learning in Continuous GamesJournalref: SIAM Journal on Mathematics of Data Science 2020 2:1, 103131Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [101] arXiv:1807.10801 (replaced) [pdf, other]

Title: On the expected runtime of multiple testing algorithms with bounded errorAuthors: Georg HahnSubjects: Statistics Theory (math.ST)
 [102] arXiv:1809.02963 (replaced) [pdf, ps, other]

Title: Variational Approximation Error in Bayesian Nonnegative Matrix FactorizationAuthors: Naoki HayashiComments: 21 pages. 1 table. Revision in Neural NetworksSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [103] arXiv:1809.06092 (replaced) [pdf, other]

Title: Testing relevant hypotheses in functional time series via selfnormalizationSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
 [104] arXiv:1811.00353 (replaced) [pdf, ps, other]

Title: HansonWright inequality in Banach spacesComments: MSC classification and acknowledgement added, minor typo corrected, references updatedSubjects: Probability (math.PR); Functional Analysis (math.FA); Statistics Theory (math.ST)
 [105] arXiv:1812.06575 (replaced) [pdf, other]

Title: Matching on Generalized Propensity Scores with Continuous ExposuresAuthors: Xiao Wu, Fabrizia Mealli, MarianthiAnna Kioumourtzoglou, Francesca Dominici, Danielle BraunComments: We create an R package, GPSmacthing, available at this https URL, to implement the proposed matching approachSubjects: Methodology (stat.ME); Applications (stat.AP)
 [106] arXiv:1812.07944 (replaced) [pdf, ps, other]

Title: Estimation and Inference in the Presence of Fractional d=1/2 and Weakly Nonstationary ProcessesSubjects: Statistics Theory (math.ST)
 [107] arXiv:1901.05947 (replaced) [pdf, other]

Title: Stochastic Gradient Descent on a Tree: an Adaptive and Robust Approach to Stochastic Convex OptimizationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
 [108] arXiv:1901.08560 (replaced) [pdf, other]

Title: SemiUnsupervised Learning: Clustering and Classifying using UltraSparse LabelsComments: 8 pages, plus appendixSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [109] arXiv:1902.03453 (replaced) [pdf]

Title: Distance metric learning based on structural neighborhoods for dimensionality reduction and classification performance improvementComments: 30 pages, 5 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [110] arXiv:1902.11038 (replaced) [pdf, other]

Title: MultiStage SelfSupervised Learning for Graph Convolutional Networks on Graphs with Few LabelsComments: AAAI Conference on Artificial Intelligence (AAAI 2020)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [111] arXiv:1902.11045 (replaced) [pdf, other]

Title: Virtual Adversarial Training on Graph Convolutional Networks in Node ClassificationComments: Chinese Conference on Pattern Recognition and Computer Vision(PRCV) 2019 Oral paperSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [112] arXiv:1903.00374 (replaced) [pdf, other]

Title: ModelBased Reinforcement Learning for AtariAuthors: Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, Henryk MichalewskiSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [113] arXiv:1903.05315 (replaced) [pdf, ps, other]

Title: Optimality of Maximum Likelihood for LogConcave Density Estimation and Bounded Convex RegressionSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG)
 [114] arXiv:1903.09231 (replaced) [pdf, ps, other]

Title: Recovering the Lowest Layer of Deep Networks with High Threshold ActivationsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [115] arXiv:1903.09321 (replaced) [pdf, other]

Title: WONDER: Weighted oneshot distributed ridge regression in high dimensionsComments: Gave the name "Wonder" to the algorithm, updated title, added algorithm for general nonisotropic designSubjects: Statistics Theory (math.ST); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Computation (stat.CO)
 [116] arXiv:1903.10646 (replaced) [pdf, other]

Title: Increasing Iterate Averaging for Solving SaddlePoint ProblemsSubjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [117] arXiv:1904.06145 (replaced) [pdf, other]

Title: Towards Photographic Image Manipulation with Balanced Growing of Generative AutoencodersComments: WACV 2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [118] arXiv:1905.10626 (replaced) [pdf, other]

Title: Rethinking Softmax CrossEntropy Loss for Adversarial RobustnessComments: ICLR 2020Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
 [119] arXiv:1905.11379 (replaced) [pdf, ps, other]

Title: A New NonLinear Conjugate Gradient Algorithm for Destructive Cure Rate Model and a Simulation Study: Illustration with Negative Binomial Competing RisksComments: arXiv admin note: text overlap with arXiv:1905.05963Subjects: Statistics Theory (math.ST); Optimization and Control (math.OC)
 [120] arXiv:1905.12121 (replaced) [pdf, other]

Title: An Investigation of Data Poisoning Defenses for Online LearningSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
 [121] arXiv:1905.13002 (replaced) [pdf, other]

Title: Temporal Parallelization of Bayesian SmoothersSubjects: Computation (stat.CO); Distributed, Parallel, and Cluster Computing (cs.DC); Dynamical Systems (math.DS)
 [122] arXiv:1906.02425 (replaced) [pdf, other]

Title: Uncertaintyguided Continual Learning with Bayesian Neural NetworksComments: Accepted at ICLR 2020Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [123] arXiv:1906.02922 (replaced) [pdf, other]

Title: ParameterFree Learning for Evolving Markov Decision Processes: The Blessing of (More) OptimismSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [124] arXiv:1906.04659 (replaced) [pdf, other]

Title: Stable Rank Normalization for Improved Generalization in Neural Networks and GANsComments: Accepted at the International Conference in Learning Representations, 2020, Addis Ababa, EthiopiaSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [125] arXiv:1906.05467 (replaced) [pdf, other]

Title: Interpretable Generative Neural SpatioTemporal Point ProcessesSubjects: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
 [126] arXiv:1907.04155 (replaced) [pdf, other]

Title: GPVAE: Deep Probabilistic Time Series ImputationComments: Accepted for publication at the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [127] arXiv:1909.10670 (replaced) [pdf, other]

Title: Subsampling Generative Adversarial Networks: Density Ratio Estimation in Feature Space with Softplus LossSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [128] arXiv:1909.11515 (replaced) [pdf, other]

Title: Mixup Inference: Better Exploiting Mixup to Defend Adversarial AttacksComments: ICLR 2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [129] arXiv:1909.12077 (replaced) [pdf, other]

Title: Symplectic ODENet: Learning Hamiltonian Dynamics with ControlJournalref: International Conference on Learning Representations (ICLR 2020); https://openreview.net/forum?id=ryxmb1rKDSSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Computational Physics (physics.compph); Machine Learning (stat.ML)
 [130] arXiv:1909.13788 (replaced) [pdf, other]

Title: Revisiting SelfTraining for Neural Sequence GenerationComments: ICLR 2020. The first two authors contributed equallySubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
 [131] arXiv:1909.13833 (replaced) [pdf, other]

Title: Relaxing Bijectivity Constraints with Continuously Indexed Normalising FlowsComments: This is a major revision of our previous paper "Localised Generative Flows". We have significantly extended our theoretical justification, and have obtained experimental results on a wider range of baselinesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [132] arXiv:1910.00643 (replaced) [pdf, other]

Title: SlowMo: Improving CommunicationEfficient Distributed SGD with Slow MomentumComments: Accepted to ICLR 2020Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [133] arXiv:1910.03175 (replaced) [pdf, other]

Title: MIM: Mutual Information MachineComments: Preprint. Project webpage: this https URLSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
 [134] arXiv:1910.03344 (replaced) [pdf, ps, other]

Title: The Universal Approximation Property: Characterizations, Existence, and a Canonical Topology for DeepLearningAuthors: Anastasis KratsiosSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS)
 [135] arXiv:1910.03561 (replaced) [pdf, other]

Title: Deep Network Classification by Scattering and Homotopy Dictionary LearningSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [136] arXiv:1910.04938 (replaced) [pdf, other]

Title: Regret Analysis of Causal Bandit ProblemsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [137] arXiv:1910.05270 (replaced) [pdf, ps, other]

Title: Fast and Bayesconsistent nearest neighborsSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [138] arXiv:1910.05725 (replaced) [pdf, other]

Title: If dropout limits trainable depth, does critical initialisation still matter? A largescale statistical analysis on ReLU networksAuthors: Arnu Pretorius, Elan van Biljon, Benjamin van Niekerk, Ryan Eloff, Matthew Reynard, Steve James, Benjamin Rosman, Herman Kamper, Steve KroonComments: 8 pages, 6 figures, under consideration at Pattern Recognition LettersSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [139] arXiv:1910.05769 (replaced) [pdf, ps, other]

Title: Large Deviation Analysis of Function Sensitivity in Random Deep Neural NetworksJournalref: J. Phys. A: Math. Theor. 53. 104002 (2020)Subjects: Disordered Systems and Neural Networks (condmat.disnn); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [140] arXiv:1910.08371 (replaced) [pdf, other]

Title: Graph Convolutional Policy for Solving Tree Decomposition via Reinforcement Learning HeuristicsComments: 8 pages, 7 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [141] arXiv:1910.11831 (replaced) [pdf, other]

Title: Stabilizing DARTS with Amended Gradient Estimation on Architectural ParametersComments: 21 pages, 11 figures, submitted to ICML 2020, extensive results are addedSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [142] arXiv:1911.03432 (replaced) [pdf, other]

Title: Penalty Method for InversionFree Deep Bilevel OptimizationComments: 17 Pages, 7 figuresSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [143] arXiv:1911.10633 (replaced) [pdf, other]

Title: The harmonic mean $χ^2$ test to substantiate scientific findingsAuthors: Leonhard HeldComments: Revised versionSubjects: Methodology (stat.ME)
 [144] arXiv:1912.01599 (replaced) [pdf, ps, other]

Title: Stationary Points of Shallow Neural Networks with Quadratic Activation FunctionComments: 30 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
 [145] arXiv:1912.02290 (replaced) [pdf, other]

Title: Hierarchical Indian Buffet Neural Networks for Bayesian Continual LearningComments: Full preprintSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [146] arXiv:1912.03703 (replaced) [pdf, other]

Title: $\mathtt{MedGraph:}$ Structural and Temporal Representation Learning of Electronic Medical RecordsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [147] arXiv:1912.04695 (replaced) [pdf, other]

Title: Transparent Classification with Multilayer Logical Perceptrons and Random BinarizationComments: AAAI20 (oral presentation); source codes addedSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [148] arXiv:1912.05541 (replaced) [pdf, other]

Title: Fundamental Entropic Laws and $\mathcal{L}_p$ Limitations of Feedback Systems: Implications for MachineLearningintheLoop ControlComments: arXiv admin note: text overlap with arXiv:1912.02628Subjects: Systems and Control (eess.SY); Information Theory (cs.IT); Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
 [149] arXiv:1912.05695 (replaced) [pdf, other]

Title: Randomized Exploration for NonStationary Stochastic Linear BanditsComments: The current version is bugfree after correctionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [150] arXiv:1912.08335 (replaced) [pdf, other]

Title: Learning under Model Misspecification: Applications to Variational and Ensemble methodsAuthors: Andres R. MasegosaComments: Typos corrected. Section 3 partially revised. New section at the appendixSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
 [151] arXiv:2001.01385 (replaced) [pdf, other]

Title: Identifying and Compensating for Feature Deviation in Imbalanced Deep LearningSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [152] arXiv:2001.05494 (replaced) [pdf, other]

Title: Learning StyleAware Symbolic Music Representations by Adversarial AutoencodersComments: Accepted for publication at the 24th European Conference on Artificial Intelligence (ECAI2020)Subjects: Sound (cs.SD); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [153] arXiv:2001.07524 (replaced) [pdf, other]

Title: Node Masking: Making Graph Neural Networks Generalize and Scale BetterSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [154] arXiv:2002.01910 (replaced) [pdf, other]

Title: FastGAE: Fast, Scalable and Effective Graph Autoencoders with Stochastic Subgraph DecodingAuthors: Guillaume Salha, Romain Hennequin, JeanBaptiste Remy, Manuel Moussallam, Michalis VazirgiannisSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
 [155] arXiv:2002.03495 (replaced) [pdf, ps, other]

Title: A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient Descent Escapes From Sharp Minima Exponentially FastSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [156] arXiv:2002.03575 (replaced) [pdf, other]

Title: Bilinear Graph Neural Network with Node InteractionsSubjects: Machine Learning (cs.LG); Graphics (cs.GR); Machine Learning (stat.ML)
 [157] arXiv:2002.03864 (replaced) [pdf, other]

Title: Deep Graph Mapper: Seeing Graphs through the Neural LensComments: 13 pages, 10 figuresSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
 [158] arXiv:2002.04014 (replaced) [pdf, other]

Title: Statistically Efficient OffPolicy Policy GradientsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
 [159] arXiv:2002.04108 (replaced) [pdf, other]

Title: Adversarial Filters of Dataset BiasesAuthors: Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew E. Peters, Ashish Sabharwal, Yejin ChoiSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
 [160] arXiv:2002.04320 (replaced) [pdf, other]

Title: Selfconcordant analysis of FrankWolfe algorithmsSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Computation (stat.CO)
 [161] arXiv:2002.05648 (replaced) [pdf, ps, other]

Title: Politics of Adversarial Machine LearningComments: Authors ordered alphabetically; 4 pagesSubjects: Computers and Society (cs.CY); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [162] arXiv:2002.06117 (replaced) [pdf, ps, other]

Title: Local continuity of logconcave projection, with applications to estimation under model misspecificationSubjects: Statistics Theory (math.ST)
 [163] arXiv:2002.06715 (replaced) [pdf, other]

Title: BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong LearningJournalref: Eighth International Conference on Learning Representations (ICLR 2020)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [164] arXiv:2002.07916 (replaced) [pdf, other]

Title: Information Condensing Active LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[ showing up to 2000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2002, contact, help (Access key information)