We gratefully acknowledge support from
the Simons Foundation and member institutions.

Statistics

New submissions

[ total of 143 entries: 1-143 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 10 Jul 20

[1]  arXiv:2007.04358 [pdf, other]
Title: Generalised Bayes Updates with $f$-divergences through Probabilistic Classifiers
Subjects: Methodology (stat.ME); Computation (stat.CO)

A stream of algorithmic advances has steadily increased the popularity of the Bayesian approach as an inference paradigm, both from the theoretical and applied perspective. Even with apparent successes in numerous application fields, a rising concern is the robustness of Bayesian inference in the presence of model misspecification, which may lead to undesirable extreme behavior of the posterior distributions for large sample sizes. Generalized belief updating with a loss function represents a central principle to making Bayesian inference more robust and less vulnerable to deviations from the assumed model. Here we consider such updates with $f$-divergences to quantify a discrepancy between the assumed statistical model and the probability distribution which generated the observed data. Since the latter is generally unknown, estimation of the divergence may be viewed as an intractable problem. We show that the divergence becomes accessible through the use of probabilistic classifiers that can leverage an estimate of the ratio of two probability distributions even when one or both of them is unknown. We demonstrate the behavior of generalized belief updates for various specific choices under the $f$-divergence family. We show that for specific divergence functions such an approach can even improve on methods evaluating the correct model likelihood function analytically.

[2]  arXiv:2007.04386 [pdf, other]
Title: Contour Models for Boundaries Enclosing Star-Shaped and Approximately Star-Shaped Polygons
Subjects: Methodology (stat.ME)

Boundaries on spatial fields divide regions with particular features from surrounding background areas. These boundaries are often described with contour lines. To measure and record these boundaries, contours are often represented as ordered sequences of spatial points that connect to form a line. Methods to identify boundary lines from interpolated spatial fields are well-established. Less attention has been paid to how to model sequences of connected spatial points. For data of the latter form, we introduce the Gaussian Star-shaped Contour Model (GSCM). GSMCs generate sequences of spatial points via generating sets of distances in various directions from a fixed starting point. The GSCM is designed for modeling contours that enclose regions that are star-shaped polygons or approximately star-shaped polygons. Metrics are introduced to assess the extent to which a polygon deviates from star-shaped. Simulation studies illustrate the performance of the GSCM in various scenarios and an analysis of Arctic sea ice edge contour data highlights how GSCMs can be applied to observational data.

[3]  arXiv:2007.04387 [pdf, other]
Title: Double spike Dirichlet priors for structured weighting
Authors: Huiming Lin, Meng Li
Subjects: Methodology (stat.ME)

Assigning weights to a large pool of objects is a fundamental task in a wide variety of applications. In this article, we introduce a concept of structured high-dimensional probability simplexes, whose most components are zero or near zero and the remaining ones are close to each other. Such structure is well motivated by 1) high-dimensional weights that are common in modern applications, and 2) ubiquitous examples in which equal weights---despite their simplicity---often achieve favorable or even state-of-the-art predictive performances. This particular structure, however, presents unique challenges both computationally and statistically. To address these challenges, we propose a new class of double spike Dirichlet priors to shrink a probability simplex to one with the desired structure. When applied to ensemble learning, such priors lead to a Bayesian method for structured high-dimensional ensembles that is useful for forecast combination and improving random forests, while enabling uncertainty quantification. We design efficient Markov chain Monte Carlo algorithms for easy implementation. Posterior contraction rates are established to provide theoretical support. We demonstrate the wide applicability and competitive performance of the proposed methods through simulations and two real data applications using the European Central Bank Survey of Professional Forecasters dataset and a UCI dataset.

[4]  arXiv:2007.04441 [pdf, other]
Title: Sparse Regression for Extreme Values
Comments: 3 figures
Subjects: Methodology (stat.ME)

We study the problem of selecting features associated with extreme values in high dimensional linear regression. Normally, in linear modeling problems, the presence of abnormal extreme values or outliers is considered an anomaly which should either be removed from the data or remedied using robust regression methods. In many situations, however, the extreme values in regression modeling are not outliers but rather the signals of interest; consider traces from spiking neurons, volatility in finance, or extreme events in climate science, for example. In this paper, we propose a new method for sparse high-dimensional linear regression for extreme values which is motivated by the Subbotin, or generalized normal distribution. This leads us to utilize an $\ell_p$ norm loss where $p$ is an even integer greater than two; we demonstrate that this loss increases the weight on extreme values. We prove consistency and variable selection consistency for the $\ell_p$ norm regression with a Lasso penalty, which we term the Extreme Lasso. Through simulation studies and real-world data data examples, we show that this method outperforms other methods currently used in the literature for selecting features of interest associated with extreme values in high-dimensional regression.

[5]  arXiv:2007.04443 [pdf, other]
Title: Minimax Efficient Finite-Difference Stochastic Gradient Estimators Using Black-Box Function Evaluations
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)

We consider stochastic gradient estimation using noisy black-box function evaluations. A standard approach is to use the finite-difference method or its variants. While natural, it is open to our knowledge whether its statistical accuracy is the best possible. This paper argues so by showing that central finite-difference is a nearly minimax optimal zeroth-order gradient estimator, among both the class of linear estimators and the much larger class of all (nonlinear) estimators.

[6]  arXiv:2007.04445 [pdf, ps, other]
Title: Estimation and inference on high-dimensional individualized treatment rule in observational data using split-and-pooled de-correlated score
Comments: 10 pages, 6 figures, 2 tables
Subjects: Methodology (stat.ME)

With the increasing adoption of electronic health records, there is an increasing interest in developing individualized treatment rules (ITRs), which recommend treatments according to patients' characteristics, from large observational data. However, there is a lack of valid inference procedures for ITRs developed from this type of data in the presence of high-dimensional covariates. In this work, we develop a penalized doubly robust method to estimate the optimal ITRs from high-dimensional data. We propose a split-and-pooled de-correlated score to construct hypothesis tests and confidence intervals. Our proposal utilizes the data splitting to conquer the slow convergence rate of nuisance parameter estimations, such as non-parametric methods for outcome regression or propensity models. We establish the limiting distributions of the split-and-pooled de-correlated score test and the corresponding one-step estimator in high-dimensional setting. Simulation and real data analysis are conducted to demonstrate the superiority of the proposed method.

[7]  arXiv:2007.04446 [pdf, other]
Title: StructureBoost: Efficient Gradient Boosting for Structured Categorical Variables
Authors: Brian Lucena
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)

Gradient boosting methods based on Structured Categorical Decision Trees (SCDT) have been demonstrated to outperform numerical and one-hot-encodings on problems where the categorical variable has a known underlying structure. However, the enumeration procedure in the SCDT is infeasible except for categorical variables with low or moderate cardinality. We propose and implement two methods to overcome the computational obstacles and efficiently perform Gradient Boosting on complex structured categorical variables. The resulting package, called StructureBoost, is shown to outperform established packages such as CatBoost and LightGBM on problems with categorical predictors that contain sophisticated structure. Moreover, we demonstrate that StructureBoost can make accurate predictions on unseen categorical values due to its knowledge of the underlying structure.

[8]  arXiv:2007.04470 [pdf, other]
Title: Finite mixture models are typically inconsistent for the number of components
Comments: 16 pages, 1 figure
Subjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)

Scientists and engineers are often interested in learning the number of subpopulations (or components) present in a data set. Practitioners commonly use a Dirichlet process mixture model (DPMM) for this purpose; in particular, they count the number of clusters---i.e. components containing at least one data point---in the DPMM posterior. But Miller and Harrison (2013) warn that the DPMM cluster-count posterior is severely inconsistent for the number of latent components when the data are truly generated from a finite mixture; that is, the cluster-count posterior probability on the true generating number of components goes to zero in the limit of infinite data. A potential alternative is to use a finite mixture model (FMM) with a prior on the number of components. Past work has shown the resulting FMM component-count posterior is consistent. But existing results crucially depend on the assumption that the component likelihoods are perfectly specified. In practice, this assumption is unrealistic, and empirical evidence (Miller and Dunson, 2019) suggests that the FMM posterior on the number of components is sensitive to the likelihood choice. In this paper, we add rigor to data-analysis folk wisdom by proving that under even the slightest model misspecification, the FMM posterior on the number of components is ultraseverely inconsistent: for any finite $k \in \mathbb{N}$, the posterior probability that the number of components is $k$ converges to 0 in the limit of infinite data. We illustrate practical consequences of our theory on simulated and real data sets.

[9]  arXiv:2007.04486 [pdf, other]
Title: Making learning more transparent using conformalized performance prediction
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

In this work, we study some novel applications of conformal inference techniques to the problem of providing machine learning procedures with more transparent, accurate, and practical performance guarantees. We provide a natural extension of the traditional conformal prediction framework, done in such a way that we can make valid and well-calibrated predictive statements about the future performance of arbitrary learning algorithms, when passed an as-yet unseen training set. In addition, we include some nascent empirical examples to illustrate potential applications.

[10]  arXiv:2007.04509 [pdf, other]
Title: Supervised Robust Profile Clustering
Comments: 30 pages, 3 figures, Supplementary materials (5 figures, 1 table)
Subjects: Applications (stat.AP)

In many studies, dimension reduction methods are used to profile participant characteristics. For example, nutrition epidemiologists often use latent class models to characterize dietary patterns. One challenge with such approaches is understanding subtle variations in patterns across subpopulations. Robust Profile Clustering (RPC) provides a dual flexible clustering model, where participants may cluster at two levels: (1) globally, where participants are clustered according to behaviors shared across an overall population, and (2) locally, where individual behaviors can deviate and cluster in subpopulations. We link clusters to a health outcome using a joint model. This model is used to derive dietary patterns in the United States and evaluate case proportion of orofacial clefts. Using dietary consumption data from the 1997-2009 National Birth Defects Prevention Study, a population-based case-control study, we determine how maternal dietary profiles are associated with an orofacial cleft among offspring. Results indicated that mothers who consumed a high proportion of fruits and vegetables compared to meats, such as chicken and beef, had lower odds delivering a child with an orofacial cleft defect.

[11]  arXiv:2007.04511 [pdf, ps, other]
Title: Causal Effects in Twin Studies: the Role of Interference
Subjects: Methodology (stat.ME)

The use of twins designs to address causal questions is becoming increasingly popular. A standard assumption is that there is no interference between twins---that is, no twin's exposure has a causal impact on their co-twin's outcome. However, there may be settings in which this assumption would not hold, and this would (1) impact the causal interpretation of parameters obtained by commonly used existing methods; (2) change which effects are of greatest interest; and (3) impact the conditions under which we may estimate these effects. We explore these issues, and we derive semi-parametric efficient estimators for causal effects in the presence of interference between twins. Using data from the Minnesota Twin Family Study, we apply our estimators to assess whether twins' consumption of alcohol in early adolescence may have a causal impact on their co-twins' substance use later in life.

[12]  arXiv:2007.04518 [pdf, other]
Title: Robust Geodesic Regression
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

This paper studies robust regression for data on Riemannian manifolds. Geodesic regression is the generalization of linear regression to a setting with a manifold-valued dependent variable and one or more real-valued independent variables. The existing work on geodesic regression uses the sum-of-squared errors to find the solution, but as in the classical Euclidean case, the least-squares method is highly sensitive to outliers. In this paper, we use M-type estimators, including the $L_1$, Huber and Tukey biweight estimators, to perform robust geodesic regression, and describe how to calculate the tuning parameters for the latter two. We also show that, on compact symmetric spaces, all M-type estimators are maximum likelihood estimators, and argue for the overall superiority of the $L_1$ estimator over the $L_2$ and Huber estimators on high-dimensional manifolds and over the Tukey biweight estimator on compact high-dimensional manifolds. Results from numerical examples, including analysis of real neuroimaging data, demonstrate the promising empirical properties of the proposed approach.

[13]  arXiv:2007.04558 [pdf, other]
Title: Beyond Scalar Treatment: A Causal Analysis of Hippocampal Atrophy on Behavioral Deficits in Alzheimer's Studies
Subjects: Applications (stat.AP); Methodology (stat.ME)

Alzheimer's disease is a progressive form of dementia that results in problems with memory, thinking and behavior. It often starts with abnormal aggregation and deposition of beta-amyloid and tau, followed by neuronal damage such as atrophy of the hippocampi, and finally leads to behavioral deficits. Despite significant progress in finding biomarkers associated with behavioral deficits, the underlying causal mechanism remains largely unknown. Here we investigate whether and how hippocampal atrophy contributes to behavioral deficits based on a large-scale observational study conducted by the Alzheimer's Disease Neuroimaging Initiative (ADNI). As a key novelty, we use 2D representations of the hippocampi, which allows us to better understand atrophy associated with different subregions. It, however, introduces methodological challenges as existing causal inference methods are not well suited for exploiting structural information embedded in the 2D exposures. Moreover, our data contain more than 6 million clinical and genetic covariates, necessitating appropriate confounder selection methods. We hence develop a novel two-step causal inference approach tailored for our ADNI data application. Analysis results suggest that atrophy of CA1 and subiculum subregions may cause more severe behavioral deficits compared to CA2 and CA3 subregions. We further evaluate our method using simulations and provide theoretical guarantees.

[14]  arXiv:2007.04586 [pdf, other]
Title: $K$-Means and Gaussian Mixture Modeling with a Separation Constraint
Comments: 16 pages, 6 tables, 1 figure with 3 subfigures
Subjects: Computation (stat.CO)

We consider the problem of clustering with $K$-means and Gaussian mixture models with a constraint on the separation between the centers in the context of real-valued data. We first propose a dynamic programming approach to solving the $K$-means problem with a separation constraint on the centers, building on (Wang and Song, 2011). In the context of fitting a Gaussian mixture model, we then propose an EM algorithm that incorporates such a constraint. A separation constraint can help regularize the output of a clustering algorithm, and we provide both simulated and real data examples to illustrate this point.

[15]  arXiv:2007.04727 [pdf, other]
Title: Supplemental Studies for Simultaneous Goodness-of-Fit Testing
Authors: Wolfgang Rolke
Subjects: Applications (stat.AP); High Energy Physics - Experiment (hep-ex)

Testing to see whether a given data set comes from some specified distribution is among the oldest types of problems in Statistics. Many such tests have been developed and their performance studied. The general result has been that while a certain test might perform well, aka have good power, in one situation it will fail badly in others. This is not a surprise given the great many ways in which a distribution can differ from the one specified in the null hypothesis. It is therefore very difficult to decide a priori which test to use. The obvious solution is not to rely on any one test but to run several of them. This however leads to the problem of simultaneous inference, that is, if several tests are done even if the null hypothesis were true, one of them is likely to reject it anyway just by random chance. In this paper we present a method that yields a p value that is uniform under the null hypothesis no matter how many tests are run. This is achieved by adjusting the p value via simulation. We present a number of simulation studies that show the uniformity of the p value and others that show that this test is superior to any one test if the power is averaged over a large number of cases.

[16]  arXiv:2007.04767 [pdf, other]
Title: Non-proportional hazards in immuno-oncology: is an old perspective needed?
Authors: Dominic Magirr
Subjects: Applications (stat.AP)

A fundamental concept in two-arm non-parametric survival analysis is the comparison of observed versus expected numbers of events on one of the treatment arms (the choice of which arm is arbitrary), where the expectation is taken assuming that the true survival curves in the two arms are identical. This concept is at the heart of the counting-process theory that provides a rigorous basis for methods such as the log-rank test. It is natural, therefore, to maintain this perspective when extending the log-rank test to deal with non-proportional hazards, for example by considering a weighted sum of the "observed - expected" terms, where larger weights are given to time periods where the hazard ratio is expected to favour the experimental treatment. In doing so, however, one may stumble across some rather subtle issues, related to the difficulty in ascribing a causal interpretation to hazard ratios, that may lead to strange conclusions. An alternative approach is to view non-parametric survival comparisons as permutation tests. With this perspective, one can easily improve on the efficiency of the log-rank test, whilst thoroughly controlling the false positive rate. In particular, for the field of immuno-oncology, where researchers often anticipate a delayed treatment effect, sample sizes could be substantially reduced without loss of power.

[17]  arXiv:2007.04791 [pdf, other]
Title: varTestnlme: Variance Components Testing in Linear and Nonlinear Mixed-effects Models
Subjects: Methodology (stat.ME); Computation (stat.CO)

The issue of variance components testing arises naturally when building mixed-effects models, to decide which effects should be modeled as fixed or random. While tests for fixed effects are available in R for models fitted with lme4, tools are missing when it comes to random effects. The varTestnlme package for R aims at filling this gap. It allows to test whether any subset of the variances and covariances are equal to zero using likelihood ratio tests. It also offers the possibility to test simultaneously for fixed effects and variance components. It can be used for linear, generalized linear or nonlinear mixed-effects models fitted via lme4, nlme or saemix. Theoretical properties of the used likelihood ratio test are recalled and examples based on different real datasets using different mixed models are provided.

[18]  arXiv:2007.04799 [pdf, other]
Title: Dissimilarity functions for rank-based hierarchical clustering of continuous variables
Comments: 36 pages, 10 figures, 7 tables
Subjects: Methodology (stat.ME)

We present a theoretical framework for a (copula-based) notion of dissimilarity between subsets of continuous random variables and study its main properties. Special attention is paid to those properties that are prone to the hierarchical agglomerative methods, such as reducibility. We hence provide insights for the use of such a measure in clustering algorithms, which allows us to cluster random variables according to the association/dependence among them, and present a simulation study. Real case studies illustrate the whole methodology.

[19]  arXiv:2007.04803 [pdf, other]
Title: Online Approximate Bayesian learning
Comments: 76 pages(including an Appendix of 43 pages), 3 figures, 2 tables
Subjects: Machine Learning (stat.ML); Statistics Theory (math.ST); Computation (stat.CO)

We introduce in this work a new approach for online approximate Bayesian learning. The main idea of the proposed method is to approximate the sequence $(\pi_t)_{t\geq 1}$ of posterior distributions by a sequence $(\tilde{\pi}_t)_{t\geq 1}$ which (i) can be estimated in an online fashion using sequential Monte Carlo methods and (ii) is shown to converge to the same distribution as the sequence $(\pi_t)_{t\geq 1}$, under weak assumptions on the statistical model at hand. In its simplest version, $(\tilde{\pi}_t)_{t\geq 1}$ is the sequence of filtering distributions associated to a particular state-space model, which can therefore be approximated using a standard particle filter algorithm. We illustrate on several challenging examples the benefits of this approach for approximate Bayesian parameter inference, and with one real data example we show that its online predictive performance can significantly outperform that of stochastic gradient descent and streaming variational Bayes.

[20]  arXiv:2007.04813 [pdf, other]
Title: Graph-Based Continual Learning
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning.

[21]  arXiv:2007.04951 [pdf, other]
Title: Adding experimental treatment arms to Multi-Arm Multi-Stage platform trials in progress
Comments: 22 pages, 2 figures
Subjects: Applications (stat.AP)

Multi-Arm Multi-Stage (MAMS) platform trials are an efficient tool for the comparison of several treatments with a control. Suppose a new treatment becomes available at some stage of a trial already in progress. There are clear benefits to adding the treatment to the current trial for comparison, but how?
As flexible as the MAMS framework is, it requires pre-planned options for how the trial proceeds at each stage in order to control the familywise error rate. Thus, as with many adaptive designs, it is difficult to make unplanned design modifications. The conditional error approach is a tool that allows unplanned design modifications while maintaining the overall error rate. In this work we use the conditional error approach to allow adding new arms to a MAMS trial in progress.
Using a single stage two-arm trial, we demonstrate the principals of incorporating additional hypotheses into the testing structure. With this framework for adding treatments and hypotheses in place, we show how to update the testing procedure for a MAMS trial in progress to incorporate additional treatment arms. Through simulation, we illustrate the operating characteristics of such procedures.

[22]  arXiv:2007.04956 [pdf, other]
Title: Bayesian Computation in Dynamic Latent Factor Models
Comments: 21 pages, 7 figures
Subjects: Methodology (stat.ME); Computation (stat.CO)

Bayesian computation for filtering and forecasting analysis is developed for a broad class of dynamic models. The ability to scale-up such analyses in non-Gaussian, nonlinear multivariate time series models is advanced through the introduction of a novel copula construction in sequential filtering of coupled sets of dynamic generalized linear models. The new copula approach is integrated into recently introduced multiscale models in which univariate time series are coupled via nonlinear forms involving dynamic latent factors representing cross-series relationships. The resulting methodology offers dramatic speed-up in online Bayesian computations for sequential filtering and forecasting in this broad, flexible class of multivariate models. Two examples in nonlinear models for very heterogeneous time series of non-negative counts demonstrate massive computational efficiencies relative to existing simulation-based methods, while defining similar filtering and forecasting outcomes.

Cross-lists for Fri, 10 Jul 20

[23]  arXiv:2007.04393 (cross-list from cs.LG) [pdf, other]
Title: Adaptive Regret for Control of Time-Varying Dynamics
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

We consider regret minimization for online control with time-varying linear dynamical systems. The metric of performance we study is adaptive policy regret, or regret compared to the best policy on {\it any interval in time}. We give an efficient algorithm that attains first-order adaptive regret guarantees for the setting of online convex optimization with memory. We also show that these first-order bounds are nearly tight.
This algorithm is then used to derive a controller with adaptive regret guarantees that provably competes with the best linear controller on any interval in time. We validate these theoretical findings experimentally on simulations of time-varying dynamics and disturbances.

[24]  arXiv:2007.04395 (cross-list from cs.LG) [pdf, other]
Title: Hierarchical Graph Matching Networks for Deep Graph Similarity Learning
Comments: 17 pages
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

While the celebrated graph neural networks yield effective representations for individual nodes of a graph, there has been relatively less success in extending to deep graph similarity learning. Recent work has considered either global-level graph-graph interactions or low-level node-node interactions, ignoring the rich cross-level interactions (e.g., between nodes and a whole graph). In this paper, we propose a Hierarchical Graph Matching Network (HGMN) for computing the graph similarity between any pair of graph-structured objects. Our model jointly learns graph representations and a graph matching metric function for computing graph similarities in an end-to-end fashion. The proposed HGMN model consists of a node-graph matching network for effectively learning cross-level interactions between nodes of a graph and a whole graph, and a siamese graph neural network for learning global-level interactions between two graphs. Our comprehensive experiments demonstrate that HGMN consistently outperforms state-of-the-art graph matching network baselines for both classification and regression tasks.

[25]  arXiv:2007.04410 (cross-list from cs.SI) [pdf, other]
Title: Network Modelling of Criminal Collaborations with Dynamic Bayesian Steady Evolutions
Subjects: Social and Information Networks (cs.SI); Applications (stat.AP); Machine Learning (stat.ML)

The threat status and criminal collaborations of potential terrorists are hidden but give rise to observable behaviours and communications. Terrorists, when acting in concert, need to communicate to organise their plots. The authorities utilise such observable behaviour and communication data to inform their investigations and policing. We present a dynamic latent network model that integrates real-time communications data with prior knowledge on individuals. This model estimates and predicts the latent strength of criminal collaboration between individuals to assist in the identification of potential cells and the measurement of their threat levels. We demonstrate how, by assuming certain plausible conditional independences across the measurements associated with this population, the network model can be combined with models of individual suspects to provide fast transparent algorithms to predict group attacks. The methods are illustrated using a simulated example involving the threat posed by a cell suspected of plotting an attack.

[26]  arXiv:2007.04431 (cross-list from cs.LG) [pdf]
Title: Understanding the effect of hyperparameter optimization on machine learning models for structure design problems
Comments: 41 pages, 15 figures,7 tables, under revision in the Computer-aided design
Subjects: Machine Learning (cs.LG); Applications (stat.AP)

To relieve the computational cost of design evaluations using expensive finite element simulations, surrogate models have been widely applied in computer-aided engineering design. Machine learning algorithms (MLAs) have been implemented as surrogate models due to their capability of learning the complex interrelations between the design variables and the response from big datasets. Typically, an MLA regression model contains model parameters and hyperparameters. The model parameters are obtained by fitting the training data. Hyperparameters, which govern the model structures and the training processes, are assigned by users before training. There is a lack of systematic studies on the effect of hyperparameters on the accuracy and robustness of the surrogate model. In this work, we proposed to establish a hyperparameter optimization (HOpt) framework to deepen our understanding of the effect. Four frequently used MLAs, namely Gaussian Process Regression (GPR), Support Vector Machine (SVM), Random Forest Regression (RFR), and Artificial Neural Network (ANN), are tested on four benchmark examples. For each MLA model, the model accuracy and robustness before and after the HOpt are compared. The results show that HOpt can generally improve the performance of the MLA models in general. HOpt leads to few improvements in the MLAs accuracy and robustness for complex problems, which are featured by high-dimensional mixed-variable design space. The HOpt is recommended for the design problems with intermediate complexity. We also investigated the additional computational costs incurred by HOpt. The training cost is closely related to the MLA architecture. After HOpt, the training cost of ANN and RFR is increased more than that of the GPR and SVM. To sum up, this study benefits the selection of HOpt method for the different types of design problems based on their complexity.

[27]  arXiv:2007.04432 (cross-list from cs.LG) [pdf, other]
Title: Collapsing Bandits and Their Application to Public Health Interventions
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

We propose and study Collpasing Bandits, a new restless multi-armed bandit (RMAB) setting in which each arm follows a binary-state Markovian process with a special structure: when an arm is played, the state is fully observed, thus "collapsing" any uncertainty, but when an arm is passive, no observation is made, thus allowing uncertainty to evolve. The goal is to keep as many arms in the "good" state as possible by planning a limited budget of actions per round. Such Collapsing Bandits are natural models for many healthcare domains in which workers must simultaneously monitor patients and deliver interventions in a way that maximizes the health of their patient cohort. Our main contributions are as follows: (i) Building on the Whittle index technique for RMABs, we derive conditions under which the Collapsing Bandits problem is indexable. Our derivation hinges on novel conditions that characterize when the optimal policies may take the form of either "forward" or "reverse" threshold policies. (ii) We exploit the optimality of threshold policies to build fast algorithms for computing the Whittle index, including a closed-form. (iii) We evaluate our algorithm on several data distributions including data from a real-world healthcare task in which a worker must monitor and deliver interventions to maximize their patients' adherence to tuberculosis medication. Our algorithm achieves a 3-order-of-magnitude speedup compared to state-of-the-art RMAB techniques while achieving similar performance.

[28]  arXiv:2007.04439 (cross-list from cs.LG) [pdf, other]
Title: Combining Differentiable PDE Solvers and Graph Neural Networks for Fluid Flow Prediction
Comments: ICML 2020
Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)

Solving large complex partial differential equations (PDEs), such as those that arise in computational fluid dynamics (CFD), is a computationally expensive process. This has motivated the use of deep learning approaches to approximate the PDE solutions, yet the simulation results predicted from these approaches typically do not generalize well to truly novel scenarios. In this work, we develop a hybrid (graph) neural network that combines a traditional graph convolutional network with an embedded differentiable fluid dynamics simulator inside the network itself. By combining an actual CFD simulator (run on a much coarser resolution representation of the problem) with the graph network, we show that we can both generalize well to new situations and benefit from the substantial speedup of neural network CFD predictions, while also substantially outperforming the coarse CFD simulation alone.

[29]  arXiv:2007.04440 (cross-list from cs.LG) [pdf, other]
Title: On the relationship between class selectivity, dimensionality, and robustness
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

While the relative trade-offs between sparse and distributed representations in deep neural networks (DNNs) are well-studied, less is known about how these trade-offs apply to representations of semantically-meaningful information. Class selectivity, the variability of a unit's responses across data classes or dimensions, is one way of quantifying the sparsity of semantic representations. Given recent evidence showing that class selectivity can impair generalization, we sought to investigate whether it also confers robustness (or vulnerability) to perturbations of input data. We found that mean class selectivity predicts vulnerability to naturalistic corruptions; networks regularized to have lower levels of class selectivity are more robust to corruption, while networks with higher class selectivity are more vulnerable to corruption, as measured using Tiny ImageNetC and CIFAR10C. In contrast, we found that class selectivity increases robustness to multiple types of gradient-based adversarial attacks. To examine this difference, we studied the dimensionality of the change in the representation due to perturbation, finding that decreasing class selectivity increases the dimensionality of this change for both corruption types, but with a notably larger increase for adversarial attacks. These results demonstrate the causal relationship between selectivity and robustness and provide new insights into the mechanisms of this relationship.

[30]  arXiv:2007.04451 (cross-list from cs.LG) [pdf, ps, other]
Title: Online probabilistic label trees
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We introduce online probabilistic label trees (OPLTs), an algorithm that trains a label tree classifier in a fully online manner, without any prior knowledge about the number of training instances, their features and labels. OPLTs are characterized by low time and space complexity as well as strong theoretical guarantees. They can be used for online multi-label and multi-class classification, including the very challenging scenarios of one- or few-shot learning. We demonstrate the attractiveness of OPLTs in a wide empirical study on several instances of the tasks mentioned above.

[31]  arXiv:2007.04458 (cross-list from cs.LG) [pdf, other]
Title: Robust Bayesian Classification Using an Optimistic Score Ratio
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We build a Bayesian contextual classification model using an optimistic score ratio for robust binary classification when there is limited information on the class-conditional, or contextual, distribution. The optimistic score searches for the distribution that is most plausible to explain the observed outcomes in the testing sample among all distributions belonging to the contextual ambiguity set which is prescribed using a limited structural constraint on the mean vector and the covariance matrix of the underlying contextual distribution. We show that the Bayesian classifier using the optimistic score ratio is conceptually attractive, delivers solid statistical guarantees and is computationally tractable. We showcase the power of the proposed optimistic score ratio classifier on both synthetic and empirical data.

[32]  arXiv:2007.04459 (cross-list from cs.LG) [pdf, other]
Title: Meta-Learning One-Class Classification with DeepSets: Application in the Milky Way
Subjects: Machine Learning (cs.LG); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (stat.ML)

We explore in this paper the use of neural networks designed for point-clouds and sets on a new meta-learning task. We present experiments on the astronomical challenge of characterizing the stellar population of stellar streams. Stellar streams are elongated structures of stars in the outskirts of the Milky Way that form when a (small) galaxy breaks up under the Milky Way's gravitational force. We consider that we obtain, for each stream, a small 'support set' of stars that belongs to this stream. We aim to predict if the other stars in that region of the sky are from that stream or not, similar to one-class classification. Each "stream task" could also be transformed into a binary classification problem in a highly imbalanced regime (or supervised anomaly detection) by using the much bigger set of "other" stars and considering them as noisy negative examples. We propose to study the problem in the meta-learning regime: we expect that we can learn general information on characterizing a stream's stellar population by meta-learning across several streams in a fully supervised regime, and transfer it to new streams using only positive supervision. We present a novel use of Deep Sets, a model developed for point-cloud and sets, trained in a meta-learning fully supervised regime, and evaluated in a one-class classification setting. We compare it against Random Forests (with and without self-labeling) in the classic setting of binary classification, retrained for each task. We show that our method outperforms the Random-Forests even though the Deep Sets is not retrained on the new tasks, and accesses only a small part of the data compared to the Random Forest. We also show that the model performs well on a real-life stream when including additional fine-tuning.

[33]  arXiv:2007.04462 (cross-list from cs.LG) [pdf, other]
Title: Scalable Computations of Wasserstein Barycenter via Input Convex Neural Networks
Comments: 16 pages,12 figures
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Wasserstein Barycenter is a principled approach to represent the weighted mean of a given set of probability distributions, utilizing the geometry induced by optimal transport. In this work, we present a novel scalable algorithm to approximate the Wasserstein Barycenters aiming at high-dimensional applications in machine learning. Our proposed algorithm is based on the Kantorovich dual formulation of the 2-Wasserstein distance as well as a recent neural network architecture, input convex neural network, that is known to parametrize convex functions. The distinguishing features of our method are: i) it only requires samples from the marginal distributions; ii) unlike the existing semi-discrete approaches, it represents the Barycenter with a generative model; iii) it allows to compute the barycenter with arbitrary weights after one training session. We demonstrate the efficacy of our algorithm by comparing it with the state-of-art methods in multiple experiments.

[34]  arXiv:2007.04466 (cross-list from cs.LG) [pdf, ps, other]
Title: URSABench: Comprehensive Benchmarking of Approximate Bayesian Inference Methods for Deep Neural Networks
Comments: Presented at the ICML 2020 Workshop on Uncertainty and Robustness in Deep Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

While deep learning methods continue to improve in predictive accuracy on a wide range of application domains, significant issues remain with other aspects of their performance including their ability to quantify uncertainty and their robustness. Recent advances in approximate Bayesian inference hold significant promise for addressing these concerns, but the computational scalability of these methods can be problematic when applied to large-scale models. In this paper, we describe initial work on the development ofURSABench(the Uncertainty, Robustness, Scalability, and Accu-racy Benchmark), an open-source suite of bench-marking tools for comprehensive assessment of approximate Bayesian inference methods with a focus on deep learning-based classification tasks

[35]  arXiv:2007.04472 (cross-list from cs.LG) [pdf, other]
Title: Evaluation of Adversarial Training on Different Types of Neural Networks in Deep Learning-based IDSs
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Network security applications, including intrusion detection systems of deep neural networks, are increasing rapidly to make detection task of anomaly activities more accurate and robust. With the rapid increase of using DNN and the volume of data traveling through systems, different growing types of adversarial attacks to defeat them create a severe challenge. In this paper, we focus on investigating the effectiveness of different evasion attacks and how to train a resilience deep learning-based IDS using different Neural networks, e.g., convolutional neural networks (CNN) and recurrent neural networks (RNN). We use the min-max approach to formulate the problem of training robust IDS against adversarial examples using two benchmark datasets. Our experiments on different deep learning algorithms and different benchmark datasets demonstrate that defense using an adversarial training-based min-max approach improves the robustness against the five well-known adversarial attack methods.

[36]  arXiv:2007.04480 (cross-list from eess.IV) [pdf, other]
Title: Automatic Probe Movement Guidance for Freehand Obstetric Ultrasound
Comments: Accepted at the 23rd International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2020)
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

We present the first system that provides real-time probe movement guidance for acquiring standard planes in routine freehand obstetric ultrasound scanning. Such a system can contribute to the worldwide deployment of obstetric ultrasound scanning by lowering the required level of operator expertise. The system employs an artificial neural network that receives the ultrasound video signal and the motion signal of an inertial measurement unit (IMU) that is attached to the probe, and predicts a guidance signal. The network termed US-GuideNet predicts either the movement towards the standard plane position (goal prediction), or the next movement that an expert sonographer would perform (action prediction). While existing models for other ultrasound applications are trained with simulations or phantoms, we train our model with real-world ultrasound video and probe motion data from 464 routine clinical scans by 17 accredited sonographers. Evaluations for 3 standard plane types show that the model provides a useful guidance signal with an accuracy of 88.8% for goal prediction and 90.9% for action prediction.

[37]  arXiv:2007.04484 (cross-list from cs.LG) [pdf, other]
Title: Transparency Tools for Fairness in AI (Luskin)
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)

We propose new tools for policy-makers to use when assessing and correcting fairness and bias in AI algorithms. The three tools are:
- A new definition of fairness called "controlled fairness" with respect to choices of protected features and filters. The definition provides a simple test of fairness of an algorithm with respect to a dataset. This notion of fairness is suitable in cases where fairness is prioritized over accuracy, such as in cases where there is no "ground truth" data, only data labeled with past decisions (which may have been biased).
- Algorithms for retraining a given classifier to achieve "controlled fairness" with respect to a choice of features and filters. Two algorithms are presented, implemented and tested. These algorithms require training two different models in two stages. We experiment with combinations of various types of models for the first and second stage and report on which combinations perform best in terms of fairness and accuracy.
- Algorithms for adjusting model parameters to achieve a notion of fairness called "classification parity". This notion of fairness is suitable in cases where accuracy is prioritized. Two algorithms are presented, one which assumes that protected features are accessible to the model during testing, and one which assumes protected features are not accessible during testing.
We evaluate our tools on three different publicly available datasets. We find that the tools are useful for understanding various dimensions of bias, and that in practice the algorithms are effective in starkly reducing a given observed bias when tested on new data.

[38]  arXiv:2007.04504 (cross-list from cs.LG) [pdf, other]
Title: Learning Differential Equations that are Easy to Solve
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Differential equations parameterized by neural networks become expensive to solve numerically as training progresses. We propose a remedy that encourages learned dynamics to be easier to solve. Specifically, we introduce a differentiable surrogate for the time cost of standard numerical solvers, using higher-order derivatives of solution trajectories. These derivatives are efficient to compute with Taylor-mode automatic differentiation. Optimizing this additional objective trades model performance against the time cost of solving the learned dynamics. We demonstrate our approach by training substantially faster, while nearly as accurate, models in supervised classification, density estimation, and time-series modelling tasks.

[39]  arXiv:2007.04528 (cross-list from math.OC) [pdf, ps, other]
Title: Higher-order methods for convex-concave min-max optimization and monotone variational inequalities
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)

We provide improved convergence rates for constrained convex-concave min-max problems and monotone variational inequalities with higher-order smoothness. In min-max settings where the $p^{th}$-order derivatives are Lipschitz continuous, we give an algorithm HigherOrderMirrorProx that achieves an iteration complexity of $O(1/T^{\frac{p+1}{2}})$ when given access to an oracle for finding a fixed point of a $p^{th}$-order equation. We give analogous rates for the weak monotone variational inequality problem. For $p>2$, our results improve upon the iteration complexity of the first-order Mirror Prox method of Nemirovski [2004] and the second-order method of Monteiro and Svaiter [2012]. We further instantiate our entire algorithm in the unconstrained $p=2$ case.

[40]  arXiv:2007.04532 (cross-list from cs.LG) [pdf, other]
Title: A Study of Gradient Variance in Deep Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

The impact of gradient noise on training deep models is widely acknowledged but not well understood. In this context, we study the distribution of gradients during training. We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling. We prove that the variance of average mini-batch gradient is minimized if the elements are sampled from a weighted clustering in the gradient space. We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training, and smaller learning rates coincide with higher variance. In addition, we introduce normalized gradient variance as a statistic that better correlates with the speed of convergence compared to gradient variance.

[41]  arXiv:2007.04540 (cross-list from cs.SI) [pdf, other]
Title: Contrastive Multiple Correspondence Analysis (cMCA): Applying the Contrastive Learning Method to Identify Political Subgroups
Comments: Both authors contributed equally to the paper and listed alphabetically. This manuscript is currently under review
Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG); Machine Learning (stat.ML)

Ideal point estimation and dimensionality reduction have long been utilized to simplify and cluster complex, high-dimensional political data (e.g., roll-call votes and surveys) for use in analysis and visualization. These methods often work by finding the directions or principal components (PCs) on which either the data varies the most or respondents make the fewest decision errors. However, these PCs, which usually reflect the left-right political spectrum, are sometimes uninformative in explaining significant differences in the distribution of the data (e.g., how to categorize a set of highly-moderate voters). To tackle this issue, we adopt an emerging analysis approach, called contrastive learning. Contrastive learning-e.g., contrastive principal component analysis (cPCA)-works by first splitting the data by predefined groups, and then deriving PCs on which the target group varies the most but the background group varies the least. As a result, cPCA can often find `hidden' patterns, such as subgroups within the target group, which PCA cannot reveal when some variables are the dominant source of variations across the groups. We contribute to the field of contrastive learning by extending it to multiple correspondence analysis (MCA) to enable an analysis of data often encountered by social scientists---namely binary, ordinal, and nominal variables. We demonstrate the utility of contrastive MCA (cMCA) by analyzing three different surveys: The 2015 Cooperative Congressional Election Study, 2012 UTokyo-Asahi Elite Survey, and 2018 European Social Survey. Our results suggest that, first, for the cases when ordinary MCA depicts differences between groups, cMCA can further identify characteristics that divide the target group; second, for the cases when MCA does not show clear differences, cMCA can successfully identify meaningful directions and subgroups, which traditional methods overlook.

[42]  arXiv:2007.04546 (cross-list from cs.LG) [pdf, other]
Title: Wandering Within a World: Online Contextualized Few-Shot Learning
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

We aim to bridge the gap between typical human and machine-learning environments by extending the standard framework of few-shot learning to an online, continual setting. In this setting, episodes do not have separate training and testing phases, and instead models are evaluated online while learning novel classes. As in real world, where the presence of spatiotemporal context helps us retrieve learned skills in the past, our online few-shot learning setting also features an underlying context that changes throughout time. Object classes are correlated within a context and inferring the correct context can lead to better performance. Building upon this setting, we propose a new few-shot learning dataset based on large scale indoor imagery that mimics the visual experience of an agent wandering within a world. Furthermore, we convert popular few-shot learning approaches into online versions and we also propose a new model named contextual prototypical memory that can make use of spatiotemporal contextual information from the recent past.

[43]  arXiv:2007.04547 (cross-list from math.PR) [pdf, ps, other]
Title: On Optimal Uniform Concentration Inequalities for Discrete Entropy in the High-dimensional Setting
Authors: Yunpeng Zhao
Subjects: Probability (math.PR); Information Theory (cs.IT); Statistics Theory (math.ST)

We prove an exponential decay concentration inequality to bound the tail probability of the difference between the log-likelihood of discrete random variables and the negative entropy. The concentration bound we derive holds uniformly over all parameter values. The new result improves the convergence rate in an earlier work \cite{zhao2020note}, from $(K^2\log K)/n=o(1)$ to $(\log K)^2/n=o(1)$, where $n$ is the sample size and $K$ is the number of possible values of the discrete variable. We further prove that the rate $(\log K)^2/n=o(1)$ is optimal. The results are extended to misspecified log-likelihoods for grouped random variables.

[44]  arXiv:2007.04553 (cross-list from econ.EM) [pdf, ps, other]
Title: Time Series Analysis of COVID-19 Infection Curve: A Change-Point Perspective
Subjects: Econometrics (econ.EM); Physics and Society (physics.soc-ph); Applications (stat.AP)

In this paper, we model the trajectory of the cumulative confirmed cases and deaths of COVID-19 (in log scale) via a piecewise linear trend model. The model naturally captures the phase transitions of the epidemic growth rate via change-points and further enjoys great interpretability due to its semiparametric nature. On the methodological front, we advance the nascent self-normalization (SN) technique (Shao, 2010) to testing and estimation of a single change-point in the linear trend of a nonstationary time series. We further combine the SN-based change-point test with the NOT algorithm (Baranowski et al., 2019) to achieve multiple change-point estimation. Using the proposed method, we analyze the trajectory of the cumulative COVID-19 cases and deaths for 30 major countries and discover interesting patterns with potentially relevant implications for effectiveness of the pandemic responses by different countries. Furthermore, based on the change-point detection algorithm and a flexible extrapolation function, we design a simple two-stage forecasting scheme for COVID-19 and demonstrate its promising performance in predicting cumulative deaths in the U.S.

[45]  arXiv:2007.04568 (cross-list from cs.LG) [pdf, ps, other]
Title: Learning to Bid Optimally and Efficiently in Adversarial First-price Auctions
Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Information Theory (cs.IT); Machine Learning (stat.ML)

First-price auctions have very recently swept the online advertising industry, replacing second-price auctions as the predominant auction mechanism on many platforms. This shift has brought forth important challenges for a bidder: how should one bid in a first-price auction, where unlike in second-price auctions, it is no longer optimal to bid one's private value truthfully and hard to know the others' bidding behaviors? In this paper, we take an online learning angle and address the fundamental problem of learning to bid in repeated first-price auctions, where both the bidder's private valuations and other bidders' bids can be arbitrary. We develop the first minimax optimal online bidding algorithm that achieves an $\widetilde{O}(\sqrt{T})$ regret when competing with the set of all Lipschitz bidding policies, a strong oracle that contains a rich set of bidding strategies. This novel algorithm is built on the insight that the presence of a good expert can be leveraged to improve performance, as well as an original hierarchical expert-chaining structure, both of which could be of independent interest in online learning. Further, by exploiting the product structure that exists in the problem, we modify this algorithm--in its vanilla form statistically optimal but computationally infeasible--to a computationally efficient and space efficient algorithm that also retains the same $\widetilde{O}(\sqrt{T})$ minimax optimal regret guarantee. Additionally, through an impossibility result, we highlight that one is unlikely to compete this favorably with a stronger oracle (than the considered Lipschitz bidding policies). Finally, we test our algorithm on three real-world first-price auction datasets obtained from Verizon Media and demonstrate our algorithm's superior performance compared to several existing bidding algorithms.

[46]  arXiv:2007.04583 (cross-list from cs.LG) [pdf, other]
Title: Graph Convolutional Networks for Graphs Containing Missing Features
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)

Graph Convolutional Network (GCN) has experienced great success in graph analysis tasks. It works by smoothing the node features across the graph. The current GCN models overwhelmingly assume that node feature information is complete. However, real-world graph data are often incomplete and containing missing features. Traditionally, people have to estimate and fill in the unknown features based on imputation techniques and then apply GCN. However, the process of feature filling and graph learning are separated, resulting in degraded and unstable performance. This problem becomes more serious when a large number of features are missing. We propose an approach that adapts GCN to graphs containing missing features. In contrast to traditional strategy, our approach integrates the processing of missing features and graph learning within the same neural network architecture. Our idea is to represent the missing data by Gaussian Mixture Model (GMM) and calculate the expected activation of neurons in the first hidden layer of GCN, while keeping the other layers of the network unchanged. This enables us to learn the GMM parameters and network weight parameters in an end-to-end manner. Notably, our approach does not increase the computational complexity of GCN and it is consistent with GCN when the features are complete. We conduct experiments on the node label classification task and demonstrate that our approach significantly outperforms the best imputation based methods by up to 99.43%, 102.96%, 6.97%, 35.36% in four benchmark graphs when a large portion of features are missing. The performance of our approach for the case with a low level of missing features is even superior to GCN for the case with complete features.

[47]  arXiv:2007.04589 (cross-list from cs.LG) [pdf, other]
Title: InfoMax-GAN: Improved Adversarial Image Generation via Information Maximization and Contrastive Learning
Comments: Initial version was presented at NeurIPS 2019 Workshop on Information Theory and Machine Learning
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

While Generative Adversarial Networks (GANs) are fundamental to many generative modelling applications, they suffer from numerous issues. In this work, we propose a principled framework to simultaneously address two fundamental issues in GANs: catastrophic forgetting of the discriminator and mode collapse of the generator. We achieve this by employing for GANs a contrastive learning and mutual information maximization approach, and perform extensive analyses to understand sources of improvements. Our approach significantly stabilises GAN training and improves GAN performance for image synthesis across five datasets under the same training and evaluation conditions against state-of-the-art works. Our approach is simple to implement and practical: it involves only one objective, is computationally inexpensive, and is robust across a wide range of hyperparameters without any tuning. For reproducibility, our code is available at https://github.com/kwotsin/mimicry.

[48]  arXiv:2007.04596 (cross-list from cs.LG) [pdf, ps, other]
Title: Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK
Comments: Conference on Learning Theory (COLT) 2020
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $\Omega(1 / d)$.

[49]  arXiv:2007.04604 (cross-list from cs.HC) [pdf]
Title: Building an Automated Gesture Imitation Game for Teenagers with ASD
Authors: Linda Nanan Vallée (ESATIC), Christophe Lohr, Sao Mai Nguyen (IMT Atlantique), Ioannis Kanellos (IMT Atlantique - INFO), O. Asseu (ESATIC)
Journal-ref: Far East Journal of Electronics and Communications, 2019, 22, pp.19 - 28
Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)

Autism spectrum disorder is a neurodevelopmental condition that includes issues with communication and social interactions. People with ASD also often have restricted interests and repetitive behaviors. In this paper we build preliminary bricks of an automated gesture imitation game that will aim at improving social interactions with teenagers with ASD. The structure of the game is presented, as well as support tools and methods for skeleton detection and imitation learning. The game shall later be implemented using an interactive robot.

[50]  arXiv:2007.04612 (cross-list from cs.LG) [pdf, other]
Title: Concept Bottleneck Models
Comments: ICML 2020
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We seek to learn models that we can interact with using high-level concepts: if the model did not think there was a bone spur in the x-ray, would it still predict severe arthritis? State-of-the-art models today do not typically support the manipulation of concepts like "the existence of bone spurs", as they are trained end-to-end to go directly from raw input (e.g., pixels) to output (e.g., arthritis severity). We revisit the classic idea of first predicting concepts that are provided at training time, and then using these concepts to predict the label. By construction, we can intervene on these \emph{concept bottleneck models} by editing their predicted concept values and propagating these changes to the final prediction. On x-ray grading and bird identification, concept bottleneck models achieve competitive accuracy with standard end-to-end models, while enabling interpretation in terms of high-level clinical concepts ("bone spurs") or bird attributes ("wing color"). These models also allow for richer human-model interaction: accuracy improves significantly if we can correct model mistakes on concepts at test time.

[51]  arXiv:2007.04618 (cross-list from cs.LG) [pdf, other]
Title: Federated Learning of User Authentication Models
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Machine learning-based User Authentication (UA) models have been widely deployed in smart devices. UA models are trained to map input data of different users to highly separable embedding vectors, which are then used to accept or reject new inputs at test time. Training UA models requires having direct access to the raw inputs and embedding vectors of users, both of which are privacy-sensitive information. In this paper, we propose Federated User Authentication (FedUA), a framework for privacy-preserving training of UA models. FedUA adopts federated learning framework to enable a group of users to jointly train a model without sharing the raw inputs. It also allows users to generate their embeddings as random binary vectors, so that, unlike the existing approach of constructing the spread out embeddings by the server, the embedding vectors are kept private as well. We show our method is privacy-preserving, scalable with number of users, and allows new users to be added to training without changing the output layer. Our experimental results on the VoxCeleb dataset for speaker verification shows our method reliably rejects data of unseen users at very high true positive rates.

[52]  arXiv:2007.04630 (cross-list from cs.LG) [pdf, other]
Title: Maximum-and-Concatenation Networks
Comments: Accepted by ICML2020
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

While successful in many fields, deep neural networks (DNNs) still suffer from some open problems such as bad local minima and unsatisfactory generalization performance. In this work, we propose a novel architecture called Maximum-and-Concatenation Networks (MCN) to try eliminating bad local minima and improving generalization ability as well. Remarkably, we prove that MCN has a very nice property; that is, \emph{every local minimum of an $(l+1)$-layer MCN can be better than, at least as good as, the global minima of the network consisting of its first $l$ layers}. In other words, by increasing the network depth, MCN can autonomously improve its local minima's goodness, what is more, \emph{it is easy to plug MCN into an existing deep model to make it also have this property}. Finally, under mild conditions, we show that MCN can approximate certain continuous functions arbitrarily well with \emph{high efficiency}; that is, the covering number of MCN is much smaller than most existing DNNs such as deep ReLU. Based on this, we further provide a tight generalization bound to guarantee the inference ability of MCN when dealing with testing samples.

[53]  arXiv:2007.04637 (cross-list from cs.LG) [pdf, other]
Title: IALE: Imitating Active Learner Ensembles
Comments: 14 pages
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Active learning (AL) prioritizes the labeling of the most informative data samples. As the performance of well-known AL heuristics highly depends on the underlying model and data, recent heuristic-independent approaches that are based on reinforcement learning directly learn a policy that makes use of the labeling history to select the next sample. However, those methods typically need a huge number of samples to sufficiently explore the relevant state space. Imitation learning approaches aim to help out but again rely on a given heuristic.
This paper proposes an improved imitation learning scheme that learns a policy for batch-mode pool-based AL. This is similar to previously presented multi-armed bandit approaches but in contrast to them we train a policy that imitates the selection of the best expert heuristic at each stage of the AL cycle directly. We use DAGGER to train the policy on a dataset and later apply it to similar datasets. With multiple AL heuristics as experts, the policy is able to reflect the choices of the best AL heuristics given the current state of the active learning process. We evaluate our method on well-known image datasets and show that we outperform state of the art imitation learners and heuristics.

[54]  arXiv:2007.04640 (cross-list from cs.LG) [pdf, other]
Title: A Policy Gradient Method for Task-Agnostic Exploration
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by limited-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning a variety of meaningful reward-based tasks downstream.

[55]  arXiv:2007.04641 (cross-list from cs.LG) [pdf, other]
Title: Probabilistic Value Selection for Space Efficient Model
Comments: Accepted in the 21st IEEE International Conference on Mobile Data Management (July 2020)
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)

An alternative to current mainstream preprocessing methods is proposed: Value Selection (VS). Unlike the existing methods such as feature selection that removes features and instance selection that eliminates instances, value selection eliminates the values (with respect to each feature) in the dataset with two purposes: reducing the model size and preserving its accuracy. Two probabilistic methods based on information theory's metric are proposed: PVS and P + VS. Extensive experiments on the benchmark datasets with various sizes are elaborated. Those results are compared with the existing preprocessing methods such as feature selection, feature transformation, and instance selection methods. Experiment results show that value selection can achieve the balance between accuracy and model size reduction.

[56]  arXiv:2007.04649 (cross-list from cs.LG) [pdf, other]
Title: Learning to Teach with Deep Interactions
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Machine teaching uses a meta/teacher model to guide the training of a student model (which will be used in real tasks) through training data selection, loss function design, etc. Previously, the teacher model only takes shallow/surface information as inputs (e.g., training iteration number, loss and accuracy from training/validation sets) while ignoring the internal states of the student model, which limits the potential of learning to teach. In this work, we propose an improved data teaching algorithm, where the teacher model deeply interacts with the student model by accessing its internal states. The teacher model is jointly trained with the student model using meta gradients propagated from a validation set. We conduct experiments on image classification with clean/noisy labels and empirically demonstrate that our algorithm makes significant improvement over previous data teaching methods.

[57]  arXiv:2007.04662 (cross-list from cs.LG) [pdf, other]
Title: Untapped Potential of Data Augmentation: A Domain Generalization Viewpoint
Comments: 6 pages, ICML 2020 Workshop on Uncertainty and Ro-bustness in Deep Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Data augmentation is a popular pre-processing trick to improve generalization accuracy. It is believed that by processing augmented inputs in tandem with the original ones, the model learns a more robust set of features which are shared between the original and augmented counterparts. However, we show that is not the case even for the best augmentation technique. In this work, we take a Domain Generalization viewpoint of augmentation based methods. This new perspective allowed for probing overfitting and delineating avenues for improvement. Our exploration with the state-of-art augmentation method provides evidence that the learned representations are not as robust even towards distortions used during training. This suggests evidence for the untapped potential of augmented examples.

[58]  arXiv:2007.04674 (cross-list from cs.LG) [pdf, other]
Title: Resource Aware Multifidelity Active Learning for Efficient Optimization
Comments: 21 pages
Subjects: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)

Traditional methods for black box optimization require a considerable number of evaluations which can be time consuming, unpractical, and often unfeasible for many engineering applications that rely on accurate representations and expensive models to evaluate. Bayesian Optimization (BO) methods search for the global optimum by progressively (actively) learning a surrogate model of the objective function along the search path. Bayesian optimization can be accelerated through multifidelity approaches which leverage multiple black-box approximations of the objective functions that can be computationally cheaper to evaluate, but still provide relevant information to the search task. Further computational benefits are offered by the availability of parallel and distributed computing architectures whose optimal usage is an open opportunity within the context of active learning. This paper introduces the Resource Aware Active Learning (RAAL) strategy, a multifidelity Bayesian scheme to accelerate the optimization of black box functions. At each optimization step, the RAAL procedure computes the set of best sample locations and the associated fidelity sources that maximize the information gain to acquire during the parallel/distributed evaluation of the objective function, while accounting for the limited computational budget. The scheme is demonstrated for a variety of benchmark problems and results are discussed for both single fidelity and multifidelity settings. In particular we observe that the RAAL strategy optimally seeds multiple points at each iteration allowing for a major speed up of the optimization task.

[59]  arXiv:2007.04676 (cross-list from cs.LG) [pdf, ps, other]
Title: Training Restricted Boltzmann Machines with Binary Synapses using the Bayesian Learning Rule
Authors: Xiangming Meng
Comments: Technical note. Work in progress
Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)

Restricted Boltzmann machines (RBMs) with low-precision synapses are much appealing with high energy efficiency. However, training RBMs with binary synapses is challenging due to the discrete nature of synapses. Recently Huang proposed one efficient method to train RBMs with binary synapses by using a combination of gradient ascent and the message passing algorithm under the variational inference framework. However, additional heuristic clipping operation is needed. In this technical note, inspired from Huang's work , we propose one alternative optimization method using the Bayesian learning rule, which is one natural gradient variational inference method. As opposed to Huang's method, we update the natural parameters of the variational symmetric Bernoulli distribution rather than the expectation parameters. Since the natural parameters take values in the entire real domain, no additional clipping is needed. Interestingly, the algorithm in \cite{huang2019data} could be viewed as one first-order approximation of the proposed algorithm, which justifies its efficacy with heuristic clipping.

[60]  arXiv:2007.04697 (cross-list from cs.DB) [pdf]
Title: Open Data Quality Evaluation: A Comparative Analysis of Open Data in Latvia
Comments: 24 pages, 2 tables, 3 figures, Baltic J. Modern Computing
Journal-ref: Baltic J. Modern Computing, Vol. 6(2018), No. 4, 363-386
Subjects: Databases (cs.DB); Computers and Society (cs.CY); Information Retrieval (cs.IR); Applications (stat.AP); Computation (stat.CO)

Nowadays open data is entering the mainstream - it is free available for every stakeholder and is often used in business decision-making. It is important to be sure data is trustable and error-free as its quality problems can lead to huge losses. The research discusses how (open) data quality could be assessed. It also covers main points which should be considered developing a data quality management solution. One specific approach is applied to several Latvian open data sets. The research provides a step-by-step open data sets analysis guide and summarizes its results. It is also shown there could exist differences in data quality depending on data supplier (centralized and decentralized data releases) and, unfortunately, trustable data supplier cannot guarantee data quality problems absence. There are also underlined common data quality problems detected not only in Latvian open data but also in open data of 3 European countries.

[61]  arXiv:2007.04713 (cross-list from econ.EM) [pdf, ps, other]
Title: Structural Gaussian mixture vector autoregressive model
Authors: Savi Virolainen
Subjects: Econometrics (econ.EM); Methodology (stat.ME)

A structural version of the Gaussian mixture vector autoregressive model is introduced. The shocks are identified by combining simultaneous diagonalization of the error term covariance matrices with zero and sign constraints. It turns out that this often leads to less restrictive identification conditions than in conventional SVAR models, while some of the constraints are also testable. The accompanying R-package gmvarkit provides easy-to-use tools for estimating the models and applying the introduced methods.

[62]  arXiv:2007.04725 (cross-list from cs.LG) [pdf, other]
Title: EVO-RL: Evolutionary-Driven Reinforcement Learning
Comments: 9 pages, 7 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

In this work, we propose a novel approach for reinforcement learning driven by evolutionary computation. Our algorithm, dubbed as Evolutionary-Driven Reinforcement Learning (evo-RL), embeds the reinforcement learning algorithm in an evolutionary cycle, where we distinctly differentiate between purely evolvable (instinctive) behaviour versus purely learnable behaviour. Furthermore, we propose that this distinction is decided by the evolutionary process, thus allowing evo-RL to be adaptive to different environments. In addition, evo-RL facilitates learning on environments with rewardless states, which makes it more suited for real-world problems with incomplete information. To show that evo-RL leads to state-of-the-art performance, we present the performance of different state-of-the-art reinforcement learning algorithms when operating within evo-RL and compare it with the case when these same algorithms are executed independently. Results show that reinforcement learning algorithms embedded within our evo-RL approach significantly outperform the stand-alone versions of the same RL algorithms on OpenAI Gym control problems with rewardless states constrained by the same computational budget.

[63]  arXiv:2007.04728 (cross-list from cs.LG) [pdf, other]
Title: Let the Data Choose its Features: Differentiable Unsupervised Feature Selection
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Scientific observations often consist of a large number of variables (features). Identifying a subset of meaningful features is often ignored in unsupervised learning, despite its potential for unraveling clear patterns hidden in the ambient space. In this paper, we present a method for unsupervised feature selection, tailored for the task of clustering. We propose a differentiable loss function which combines the graph Laplacian with a gating mechanism based on continuous approximation of Bernoulli random variables. The Laplacian is used to define a scoring term that favors low-frequency features, while the parameters of the Bernoulli variables are trained to enable selection of the most informative features. We mathematically motivate the proposed approach and demonstrate that in the high noise regime, it is crucial to compute the Laplacian on the gated inputs, rather than on the full feature set. Experimental demonstration of the efficacy of the proposed approach and its advantage over current baselines is provided using several real-world examples.

[64]  arXiv:2007.04731 (cross-list from cs.LG) [pdf, other]
Title: Fast Variational Learning in State-Space Gaussian Process Models
Comments: To appear in MLSP 2020
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Gaussian process (GP) regression with 1D inputs can often be performed in linear time via a stochastic differential equation formulation. However, for non-Gaussian likelihoods, this requires application of approximate inference methods which can make the implementation difficult, e.g., expectation propagation can be numerically unstable and variational inference can be computationally inefficient. In this paper, we propose a new method that removes such difficulties. Building upon an existing method called conjugate-computation variational inference, our approach enables linear-time inference via Kalman recursions while avoiding numerical instabilities and convergence issues. We provide an efficient JAX implementation which exploits just-in-time compilation and allows for fast automatic differentiation through large for-loops. Overall, our approach leads to fast and stable variational inference in state-space GP models that can be scaled to time series with millions of data points.

[65]  arXiv:2007.04743 (cross-list from q-bio.PE) [pdf, other]
Title: Racial Impact on Infections and Deaths due to COVID-19 in New York City
Comments: 6 pages, 7 figures, 1 table
Subjects: Populations and Evolution (q-bio.PE); Physics and Society (physics.soc-ph); Applications (stat.AP)

Redlining is the discriminatory practice whereby institutions avoided investment in certain neighborhoods due to their demographics. Here we explore the lasting impacts of redlining on the spread of COVID-19 in New York City (NYC). Using data available through the Home Mortgage Disclosure Act, we construct a redlining index for each NYC census tract via a multi-level logistical model. We compare this redlining index with the COVID-19 statistics for each NYC Zip Code Tabulation Area. Accurate mappings of the pandemic would aid the identification of the most vulnerable areas and permit the most effective allocation of medical resources, while reducing ethnic health disparities.

[66]  arXiv:2007.04750 (cross-list from cs.LG) [pdf, other]
Title: Recurrent Neural-Linear Posterior Sampling for Non-Stationary Contextual Bandits
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

An agent in a non-stationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a non-stationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and non-contextual non-stationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional non-stationary bandit algorithms.

[67]  arXiv:2007.04758 (cross-list from q-fin.RM) [pdf, ps, other]
Title: A Bivariate Compound Dynamic Contagion Process for Cyber Insurance
Authors: Jiwook Jang, Rosy Oh
Subjects: Risk Management (q-fin.RM); Other Statistics (stat.OT)

As corporates and governments become more digital, they become vulnerable to various forms of cyber attack. Cyber insurance products have been used as risk management tools, yet their pricing does not reflect actual risk, including that of multiple, catastrophic and contagious losses. For the modelling of aggregate losses from cyber events, in this paper we introduce a bivariate compound dynamic contagion process, where the bivariate dynamic contagion process is a point process that includes both externally excited joint jumps, which are distributed according to a shot noise Cox process and two separate self-excited jumps, which are distributed according to the branching structure of a Hawkes process with an exponential fertility rate, respectively. We analyse the theoretical distributional properties for these processes systematically, based on the piecewise deterministic Markov process developed by Davis (1984) and the univariate dynamic contagion process theory developed by Dassios and Zhao (2011). The analytic expression of the Laplace transform of the compound process and its moments are presented, which have the potential to be applicable to a variety of problems in credit, insurance, market and other operational risks. As an application of this process, we provide insurance premium calculations based on its moments. Numerical examples show that this compound process can be used for the modelling of aggregate losses from cyber events. We also provide the simulation algorithm for statistical analysis, further business applications and research.

[68]  arXiv:2007.04759 (cross-list from cs.LG) [pdf, other]
Title: Expressivity of Deep Neural Networks
Comments: This review paper will appear as a book chapter in the book "Theory of Deep Learning" by Cambridge University Press
Subjects: Machine Learning (cs.LG); Functional Analysis (math.FA); Machine Learning (stat.ML)

In this review paper, we give a comprehensive overview of the large variety of approximation results for neural networks. Approximation rates for classical function spaces as well as benefits of deep neural networks over shallow ones for specifically structured function classes are discussed. While the mainbody of existing results is for general feedforward architectures, we also depict approximation results for convolutional, residual and recurrent neural networks.

[69]  arXiv:2007.04777 (cross-list from eess.IV) [pdf, other]
Title: Self-supervised edge features for improved Graph Neural Network training
Comments: Comments welcome. arXiv admin note: substantial text overlap with arXiv:2006.12971
Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)

Graph Neural Networks (GNN) have been extensively used to extract meaningful representations from graph structured data and to perform predictive tasks such as node classification and link prediction. In recent years, there has been a lot of work incorporating edge features along with node features for prediction tasks. One of the main difficulties in using edge features is that they are often handcrafted, hard to get, specific to a particular domain, and may contain redundant information. In this work, we present a framework for creating new edge features, applicable to any domain, via a combination of self-supervised and unsupervised learning. In addition to this, we use Forman-Ricci curvature as an additional edge feature to encapsulate the local geometry of the graph. We then encode our edge features via a Set Transformer and combine them with node features extracted from popular GNN architectures for node classification in an end-to-end training scheme. We validate our work on three biological datasets comprising of single-cell RNA sequencing data of neurological disease, \textit{in vitro} SARS-CoV-2 infection, and human COVID-19 patients. We demonstrate that our method achieves better performance on node classification tasks over baseline Graph Attention Network (GAT) and Graph Convolutional Network (GCN) models. Furthermore, given the attention mechanism on edge and node features, we are able to interpret the cell types and genes that determine the course and severity of COVID-19, contributing to a growing list of potential disease biomarkers and therapeutic targets.

[70]  arXiv:2007.04785 (cross-list from cs.LG) [pdf, other]
Title: Neural Architecture Search with GBDT
Comments: Code is available at this https URL
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Neural architecture search (NAS) with an accuracy predictor that predicts the accuracy of candidate architectures has drawn increasing interests due to its simplicity and effectiveness. Previous works employ neural network based predictors which unfortunately cannot well exploit the tabular data representations of network architectures. As decision tree-based models can better handle tabular data, in this paper, we propose to leverage gradient boosting decision tree (GBDT) as the predictor for NAS and demonstrate that it can improve the prediction accuracy and help to find better architectures than neural network based predictors. Moreover, considering that a better and compact search space can ease the search process, we propose to prune the search space gradually according to important features derived from GBDT using an interpreting tool named SHAP. In this way, NAS can be performed by first pruning the search space (using GBDT as a pruner) and then searching a neural architecture (using GBDT as a predictor), which is more efficient and effective. Experiments on NASBench-101 and ImageNet demonstrate the effectiveness of GBDT for NAS: (1) NAS with GBDT predictor finds top-10 architecture (among all the architectures in the search space) with $0.18\%$ test regret on NASBench-101, and achieves $24.2\%$ top-1 error rate on ImageNet; and (2) GBDT based search space pruning and neural architecture search further achieves $23.5\%$ top-1 error rate on ImageNet.

[71]  arXiv:2007.04790 (cross-list from cs.LG) [pdf, other]
Title: MO-PaDGAN: Generating Diverse Designs with Multivariate Performance Enhancement
Authors: Wei Chen, Faez Ahmed
Comments: arXiv admin note: text overlap with arXiv:2002.11304
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Deep generative models have proven useful for automatic design synthesis and design space exploration. However, they face three challenges when applied to engineering design: 1) generated designs lack diversity, 2) it is difficult to explicitly improve all the performance measures of generated designs, and 3) existing models generally do not generate high-performance novel designs, outside the domain of the training data. To address these challenges, we propose MO-PaDGAN, which contains a new Determinantal Point Processes based loss function for probabilistic modeling of diversity and performances. Through a real-world airfoil design example, we demonstrate that MO-PaDGAN expands the existing boundary of the design space towards high-performance regions and generates new designs with high diversity and performances exceeding training data.

[72]  arXiv:2007.04793 (cross-list from cs.CV) [pdf, other]
Title: Statistical shape analysis of brain arterial networks (BAN)
Comments: arXiv admin note: substantial text overlap with arXiv:2003.00287
Subjects: Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG); Applications (stat.AP)

Structures of brain arterial networks (BANs) - that are complex arrangements of individual arteries, their branching patterns, and inter-connectivities - play an important role in characterizing and understanding brain physiology. One would like tools for statistically analyzing the shapes of BANs, i.e. quantify shape differences, compare population of subjects, and study the effects of covariates on these shapes. This paper mathematically represents and statistically analyzes BAN shapes as elastic shape graphs. Each elastic shape graph is made up of nodes that are connected by a number of 3D curves, and edges, with arbitrary shapes. We develop a mathematical representation, a Riemannian metric and other geometrical tools, such as computations of geodesics, means and covariances, and PCA for analyzing elastic graphs and BANs. This analysis is applied to BANs after separating them into four components -- top, bottom, left, and right. This framework is then used to generate shape summaries of BANs from 92 subjects, and to study the effects of age and gender on shapes of BAN components. We conclude that while gender effects require further investigation, the age has a clear, quantifiable effect on BAN shapes. Specifically, we find an increased variance in BAN shapes as age increases.

[73]  arXiv:2007.04800 (cross-list from cs.LG) [pdf, other]
Title: When Humans and Machines Make Joint Decisions: A Non-Symmetric Bandit Model
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

How can humans and machines learn to make joint decisions? This has become an important question in domains such as medicine, law and finance. We approach the question from a theoretical perspective and formalize our intuitions about human-machine decision making in a non-symmetric bandit model. In doing so, we follow the example of a doctor who is assisted by a computer program. We show that in our model, exploration is generally hard. In particular, unless one is willing to make assumptions about how human and machine interact, the machine cannot explore efficiently. We highlight one such assumption, policy space independence, which resolves the coordination problem and allows both players to explore independently. Our results shed light on the fundamental difficulties faced by the interaction of humans and machines. We also discuss practical implications for the design of algorithmic decision systems.

[74]  arXiv:2007.04806 (cross-list from cs.LG) [pdf, other]
Title: Client Adaptation improves Federated Learning with Simulated Non-IID Clients
Comments: 11 pages, 11 figures. To appear at International Workshop on Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML 2020
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

We present a federated learning approach for learning a client adaptable, robust model when data is non-identically and non-independently distributed (non-IID) across clients. By simulating heterogeneous clients, we show that adding learned client-specific conditioning improves model performance, and the approach is shown to work on balanced and imbalanced data set from both audio and image domains. The client adaptation is implemented by a conditional gated activation unit and is particularly beneficial when there are large differences between the data distribution for each client, a common scenario in federated learning.

[75]  arXiv:2007.04825 (cross-list from cs.LG) [pdf, other]
Title: Fast Transformers with Clustered Attention
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. To further improve this approximation, we use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.

[76]  arXiv:2007.04838 (cross-list from cs.LG) [pdf, other]
Title: Improving the Robustness of Trading Strategy Backtesting with Boltzmann Machines and Generative Adversarial Networks
Comments: 72 pages, 30 figures
Subjects: Machine Learning (cs.LG); Portfolio Management (q-fin.PM); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)

This article explores the use of machine learning models to build a market generator. The underlying idea is to simulate artificial multi-dimensional financial time series, whose statistical properties are the same as those observed in the financial markets. In particular, these synthetic data must preserve the probability distribution of asset returns, the stochastic dependence between the different assets and the autocorrelation across time. The article proposes then a new approach for estimating the probability distribution of backtest statistics. The final objective is to develop a framework for improving the risk management of quantitative investment strategies, in particular in the space of smart beta, factor investing and alternative risk premia.

[77]  arXiv:2007.04849 (cross-list from quant-ph) [pdf, ps, other]
Title: Physics-inspired forms of the Bayesian Cramér-Rao bound
Authors: Mankei Tsang
Comments: 4 pages
Subjects: Quantum Physics (quant-ph); Statistics Theory (math.ST)

Using the language of differential geometry, I derive a form of the Bayesian Cram\'er-Rao bound that remains invariant under reparametrization. By assuming that the prior probability density is the square of a wavefunction, I also express the bound in terms of functionals that are quadratic with respect to the wavefunction and its gradient. The problem of finding an unfavorable prior to tighten the bound for minimax estimation is shown, in a special case, to be equivalent to finding the ground-state energy with the Schr\"odinger equation, with the Fisher information playing the role of the potential.

[78]  arXiv:2007.04871 (cross-list from cs.LG) [pdf, other]
Title: Subject-Aware Contrastive Learning for Biosignals
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)

Datasets for biosignals, such as electroencephalogram (EEG) and electrocardiogram (ECG), often have noisy labels and have limited number of subjects (<100). To handle these challenges, we propose a self-supervised approach based on contrastive learning to model biosignals with a reduced reliance on labeled data and with fewer subjects. In this regime of limited labels and subjects, intersubject variability negatively impacts model performance. Thus, we introduce subject-aware learning through (1) a subject-specific contrastive loss, and (2) an adversarial training to promote subject-invariance during the self-supervised learning. We also develop a number of time-series data augmentation techniques to be used with the contrastive loss for biosignals. Our method is evaluated on publicly available datasets of two different biosignals with different tasks: EEG decoding and ECG anomaly detection. The embeddings learned using self-supervision yield competitive classification results compared to entirely supervised methods. We show that subject-invariance improves representation quality for these tasks, and observe that subject-specific loss increases performance when fine-tuning with supervised labels.

[79]  arXiv:2007.04873 (cross-list from cs.LG) [pdf, other]
Title: Invertible Zero-Shot Recognition Flows
Comments: ECCV2020
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Deep generative models have been successfully applied to Zero-Shot Learning (ZSL) recently. However, the underlying drawbacks of GANs and VAEs (e.g., the hardness of training with ZSL-oriented regularizers and the limited generation quality) hinder the existing generative ZSL models from fully bypassing the seen-unseen bias. To tackle the above limitations, for the first time, this work incorporates a new family of generative models (i.e., flow-based models) into ZSL. The proposed Invertible Zero-shot Flow (IZF) learns factorized data embeddings (i.e., the semantic factors and the non-semantic ones) with the forward pass of an invertible flow network, while the reverse pass generates data samples. This procedure theoretically extends conventional generative flows to a factorized conditional scheme. To explicitly solve the bias problem, our model enlarges the seen-unseen distributional discrepancy based on negative sample-based distance measurement. Notably, IZF works flexibly with either a naive Bayesian classifier or a held-out trainable one for zero-shot recognition. Experiments on widely-adopted ZSL benchmarks demonstrate the significant performance gain of IZF over existing methods, in both classic and generalized settings.

[80]  arXiv:2007.04876 (cross-list from cs.LG) [pdf, ps, other]
Title: Multinomial Logit Bandit with Low Switching Cost
Comments: Accepted for presentation at the International Conference on Machine Learning (ICML) 2020
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We study multinomial logit bandit with limited adaptivity, where the algorithms change their exploration actions as infrequently as possible when achieving almost optimal minimax regret. We propose two measures of adaptivity: the assortment switching cost and the more fine-grained item switching cost. We present an anytime algorithm (AT-DUCB) with $O(N \log T)$ assortment switches, almost matching the lower bound $\Omega(\frac{N \log T}{ \log \log T})$. In the fixed-horizon setting, our algorithm FH-DUCB incurs $O(N \log \log T)$ assortment switches, matching the asymptotic lower bound. We also present the ESUCB algorithm with item switching cost $O(N \log^2 T)$.

[81]  arXiv:2007.04897 (cross-list from q-bio.QM) [pdf, other]
Title: Guiding Deep Molecular Optimization with Genetic Exploration
Subjects: Quantitative Methods (q-bio.QM); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

De novo molecular design attempts to search over the chemical space for molecules with the desired property. Recently, deep learning has gained considerable attention as a promising approach to solve the problem. In this paper, we propose genetic expert-guided learning (GEGL), a simple yet novel framework for training a deep neural network (DNN) to generate highly-rewarding molecules. Our main idea is to design a "genetic expert improvement" procedure, which generates high-quality targets for imitation learning of the DNN. Extensive experiments show that GEGL significantly improves over state-of-the-art methods. For example, GEGL manages to solve the penalized octanol-water partition coefficient optimization with a score of 31.82, while the best-known score in the literature is 26.1. Besides, for the GuacaMol benchmark with 20 tasks, our method achieves the highest score for 19 tasks, in comparison with state-of-the-art methods, and newly obtains the perfect score for three tasks.

[82]  arXiv:2007.04911 (cross-list from cs.LG) [pdf, other]
Title: GAMA: a General Automated Machine learning Assistant
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

The General Automated Machine learning Assistant (GAMA) is a modular AutoML system developed to empower users to track and control how AutoML algorithms search for optimal machine learning pipelines, and facilitate AutoML research itself. In contrast to current, often black-box systems, GAMA allows users to plug in different AutoML and post-processing techniques, logs and visualizes the search process, and supports easy benchmarking. It currently features three AutoML search algorithms, two model post-processing steps, and is designed to allow for more components to be added.

[83]  arXiv:2007.04915 (cross-list from cs.LG) [pdf, other]
Title: Influence Diagram Bandits: Variational Thompson Sampling for Structured Bandit Problems
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We propose a novel framework for structured bandits, which we call an influence diagram bandit. Our framework captures complex statistical dependencies between actions, latent variables, and observations; and thus unifies and extends many existing models, such as combinatorial semi-bandits, cascading bandits, and low-rank bandits. We develop novel online learning algorithms that learn to act efficiently in our models. The key idea is to track a structured posterior distribution of model parameters, either exactly or approximately. To act, we sample model parameters from their posterior and then use the structure of the influence diagram to find the most optimistic action under the sampled parameters. We empirically evaluate our algorithms in three structured bandit problems, and show that they perform as well as or better than problem-specific state-of-the-art baselines.

[84]  arXiv:2007.04921 (cross-list from q-bio.QM) [pdf, other]
Title: Graph Neural Network Based Coarse-Grained Mapping Prediction
Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)

The selection of coarse-grained (CG) mapping operators is a critical step for CG molecular dynamics (MD) simulation. It is still an open question about what is optimal for this choice and there is a need for theory. The current state-of-the art method is mapping operators manually selected by experts. In this work, we demonstrate an automated approach by viewing this problem as supervised learning where we seek to reproduce the mapping operators produced by experts. We present a graph neural network based CG mapping predictor called DEEP SUPERVISED GRAPH PARTITIONING MODEL(DSGPM) that treats mapping operators as a graph segmentation problem. DSGPM is trained on a novel dataset, Human-annotated Mappings (HAM), consisting of 1,206 molecules with expert annotated mapping operators. HAM can be used to facilitate further research in this area. Our model uses a novel metric learning objective to produce high-quality atomic features that are used in spectral clustering. The results show that the DSGPM outperforms state-of-the-art methods in the field of graph segmentation.

[85]  arXiv:2007.04929 (cross-list from cs.LG) [pdf, other]
Title: Learning Graph Structure With A Finite-State Automaton Layer
Comments: Submitted to NeurIPS 2020
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Graph-based neural network models are producing strong results in a number of domains, in part because graphs provide flexibility to encode domain knowledge in the form of relational structure (edges) between nodes in the graph. In practice, edges are used both to represent intrinsic structure (e.g., abstract syntax trees of programs) and more abstract relations that aid reasoning for a downstream task (e.g., results of relevant program analyses). In this work, we study the problem of learning to derive abstract relations from the intrinsic graph structure. Motivated by their power in program analyses, we consider relations defined by paths on the base graph accepted by a finite-state automaton. We show how to learn these relations end-to-end by relaxing the problem into learning finite-state automata policies on a graph-based POMDP and then training these policies using implicit differentiation. The result is a differentiable Graph Finite-State Automaton (GFSA) layer that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. We demonstrate that this layer can find shortcuts in grid-world graphs and reproduce simple static analyses on Python programs. Additionally, we combine the GFSA layer with a larger graph-based model trained end-to-end on the variable misuse program understanding task, and find that using the GFSA layer leads to better performance than using hand-engineered semantic edges or other baseline methods for adding learned edge types.

[86]  arXiv:2007.04938 (cross-list from cs.LG) [pdf, other]
Title: SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Model-free deep reinforcement learning (RL) has been successful in a range of challenging domains. However, there are some remaining issues, such as stabilizing the optimization of nonlinear function approximators, preventing error propagation due to the Bellman backup in Q-learning, and efficient exploration. To mitigate these issues, we present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy RL algorithms. SUNRISE integrates three key ingredients: (a) bootstrap with random initialization which improves the stability of the learning process by training a diverse ensemble of agents, (b) weighted Bellman backups, which prevent error propagation in Q-learning by reweighing sample transitions based on uncertainty estimates from the ensembles, and (c) an inference method that selects actions using highest upper-confidence bounds for efficient exploration. Our experiments show that SUNRISE significantly improves the performance of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments. Our training code is available at https://github.com/pokaxpoka/sunrise.

[87]  arXiv:2007.04965 (cross-list from cs.LG) [pdf, other]
Title: A Study on Encodings for Neural Architecture Search
Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

Neural architecture search (NAS) has been extensively studied in the past few years. A popular approach is to represent each neural architecture in the search space as a directed acyclic graph (DAG), and then search over all DAGs by encoding the adjacency matrix and list of operations as a set of hyperparameters. Recent work has demonstrated that even small changes to the way each architecture is encoded can have a significant effect on the performance of NAS algorithms.
In this work, we present the first formal study on the effect of architecture encodings for NAS, including a theoretical grounding and an empirical study. First we formally define architecture encodings and give a theoretical characterization on the scalability of the encodings we study Then we identify the main encoding-dependent subroutines which NAS algorithms employ, running experiments to show which encodings work best with each subroutine for many popular algorithms. The experiments act as an ablation study for prior work, disentangling the algorithmic and encoding-based contributions, as well as a guideline for future work. Our results demonstrate that NAS encodings are an important design decision which can have a significant impact on overall performance. Our code is available at https://github.com/naszilla/nas-encodings.

[88]  arXiv:2007.04972 (cross-list from cs.LG) [pdf, other]
Title: Prostate motion modelling using biomechanically-trained deep neural networks on unstructured nodes
Comments: Accepted to MICCAI 2020
Subjects: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)

In this paper, we propose to train deep neural networks with biomechanical simulations, to predict the prostate motion encountered during ultrasound-guided interventions. In this application, unstructured points are sampled from segmented pre-operative MR images to represent the anatomical regions of interest. The point sets are then assigned with point-specific material properties and displacement loads, forming the un-ordered input feature vectors. An adapted PointNet can be trained to predict the nodal displacements, using finite element (FE) simulations as ground-truth data. Furthermore, a versatile bootstrap aggregating mechanism is validated to accommodate the variable number of feature vectors due to different patient geometries, comprised of a training-time bootstrap sampling and a model averaging inference. This results in a fast and accurate approximation to the FE solutions without requiring subject-specific solid meshing. Based on 160,000 nonlinear FE simulations on clinical imaging data from 320 patients, we demonstrate that the trained networks generalise to unstructured point sets sampled directly from holdout patient segmentation, yielding a near real-time inference and an expected error of 0.017 mm in predicted nodal displacement.

[89]  arXiv:2007.04973 (cross-list from cs.LG) [pdf, other]
Title: Contrastive Code Representation Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE); Machine Learning (stat.ML)

Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on the raw text of programs. In particular, we design an unsupervised pretext task by generating textually divergent copies of source functions via automated source-to-source compiler transforms that preserve semantics. We train a neural model to identify variants of an anchor program within a large batch of negatives. To solve this task, the network must extract program features representing the functionality, not form, of the program. This is the first application of instance discrimination to code representation learning to our knowledge. We pre-train models over 1.8m unannotated JavaScript methods mined from GitHub. ContraCode pre-training improves code summarization accuracy by 7.9% over supervised approaches and 4.8% over RoBERTa pre-training. Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves the accuracy of existing baselines.

[90]  arXiv:2007.04976 (cross-list from cs.LG) [pdf, other]
Title: One Policy to Control Them All: Shared Modular Policies for Agent-Agnostic Control
Comments: Accepted at ICML 2020. Videos and code at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Reinforcement learning is typically concerned with learning control policies tailored to a particular agent. We investigate whether there exists a single global policy that can generalize to control a wide variety of agent morphologies -- ones in which even dimensionality of state and action spaces changes. We propose to express this global policy as a collection of identical modular neural networks, dubbed as Shared Modular Policies (SMP), that correspond to each of the agent's actuators. Every module is only responsible for controlling its corresponding actuator and receives information from only its local sensors. In addition, messages are passed between modules, propagating information between distant modules. We show that a single modular policy can successfully generate locomotion behaviors for several planar agents with different skeletal structures such as monopod hoppers, quadrupeds, bipeds, and generalize to variants not seen during training -- a process that would normally require training and manual hyperparameter tuning for each morphology. We observe that a wide variety of drastically diverse locomotion styles across morphologies as well as centralized coordination emerges via message passing between decentralized modules purely from the reinforcement learning objective. Videos and code at https://huangwl18.github.io/modular-rl/

Replacements for Fri, 10 Jul 20

[91]  arXiv:1510.02753 (replaced) [pdf, ps, other]
Title: Organic direct and indirect effects with post-treatment common causes of mediator and outcome
Authors: Judith J Lok
Comments: 9 pages
Subjects: Methodology (stat.ME)
[92]  arXiv:1708.02166 (replaced) [pdf, other]
Title: Nonlinear spectral analysis: A local Gaussian approach
Comments: Version 4: Major revision from version 3, with new theory/figures. 135 pages (main part 32 + appendices 103), 11 + 16 figures
Subjects: Methodology (stat.ME)
[93]  arXiv:1710.03863 (replaced) [pdf, ps, other]
Title: On Estimation of $L_{r}$-Norms in Gaussian White Noise Models
Comments: To appear in Probability Theory and Related Fields
Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG)
[94]  arXiv:1711.04145 (replaced) [pdf, other]
Title: Minimax estimation in linear models with unknown design over finite alphabets
Authors: Merle Behr, Axel Munk
Subjects: Statistics Theory (math.ST)
[95]  arXiv:1803.06675 (replaced) [pdf, other]
Title: Rare Feature Selection in High Dimensions
Comments: 42 pages, 10 figures
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Computation (stat.CO); Machine Learning (stat.ML)
[96]  arXiv:1805.08304 (replaced) [pdf, other]
Title: Anchored Bayesian Gaussian Mixture Models
Comments: 65 pages, 11 figures, 11 tables
Subjects: Methodology (stat.ME)
[97]  arXiv:1806.10120 (replaced) [pdf, other]
Title: Maximum Likelihood Estimation for Totally Positive Log-Concave Densities
Subjects: Statistics Theory (math.ST); Combinatorics (math.CO)
[98]  arXiv:1807.02161 (replaced) [pdf, ps, other]
Title: Minimizing Sensitivity to Model Misspecification
Subjects: Econometrics (econ.EM); Methodology (stat.ME)
[99]  arXiv:1811.00724 (replaced) [pdf, other]
Title: Bayesian Hierarchical Modeling on Covariance Valued Data
Comments: Some key references are missing in the old version which are corrected in this version
Subjects: Applications (stat.AP)
[100]  arXiv:1901.03904 (replaced) [pdf]
Title: A Speech Act Classifier for Persian Texts and its Application in Identify Speech Act of Rumors
Comments: Published Link: this http URL
Journal-ref: Journal of Soft Computing and Information Technology, 9, 1, 1399 (2020), 18-27
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
[101]  arXiv:1903.12077 (replaced) [pdf, ps, other]
Title: Time series models for realized covariance matrices based on the matrix-F distribution
Subjects: Statistics Theory (math.ST); Econometrics (econ.EM); Methodology (stat.ME)
[102]  arXiv:1904.07150 (replaced) [pdf, other]
Title: Variational Bayes for high-dimensional linear regression with sparse priors
Comments: 42 pages. We have added oracle contraction rates, removed the mutual coherence assumption, significantly expanded the simulations and generally improved the presentation
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
[103]  arXiv:1905.11497 (replaced) [pdf, other]
Title: Estimating Average Treatment Effects Utilizing Fractional Imputation when Confounders are Subject to Missingness
Subjects: Methodology (stat.ME); Other Statistics (stat.OT)
[104]  arXiv:1909.00721 (replaced) [pdf, other]
Title: Greedy clustering of count data through a mixture of multinomial PCA
Authors: Nicolas Jouvin (1 and 2), Pierre Latouche (2), Charles Bouveyron (3), Guillaume Bataillon (4), Alain Livartowski (4) ((1) Laboratoire SAMM EA 4543, (2) Laboratoire MAP5 UMR 8145, (3) Laboratoire J.A. Dieudonné UMR 7351 (4) Institut Curie)
Comments: 34 pages, 11 figures, published in : Computational Statistics
Subjects: Methodology (stat.ME)
[105]  arXiv:1909.02553 (replaced) [pdf, other]
Title: Smooth Contextual Bandits: Bridging the Parametric and Non-differentiable Regret Regimes
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
[106]  arXiv:1909.02707 (replaced) [pdf, other]
Title: Restricted Minimum Error Entropy Criterion for Robust Classification
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[107]  arXiv:1909.07543 (replaced) [pdf, other]
Title: Attraction-Repulsion Actor-Critic for Continuous Control Reinforcement Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
[108]  arXiv:1910.00760 (replaced) [pdf, other]
Title: Efficient Graph Generation with Graph Recurrent Attention Networks
Comments: Neural Information Processing Systems (NeurIPS) 2019
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[109]  arXiv:1910.00780 (replaced) [pdf, other]
Title: How Does Topology of Neural Architectures Impact Gradient Propagation and Model Performance?
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[110]  arXiv:1910.01741 (replaced) [pdf, other]
Title: Improving Sample Efficiency in Model-Free Reinforcement Learning from Images
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
[111]  arXiv:1910.03103 (replaced) [pdf, other]
Title: Energy-Aware Neural Architecture Optimization with Fast Splitting Steepest Descent
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[112]  arXiv:1910.04817 (replaced) [pdf, other]
Title: Estimation of Bounds on Potential Outcomes For Decision Making
Journal-ref: ICML 2020
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[113]  arXiv:1910.09466 (replaced) [pdf, ps, other]
Title: Sparsification as a Remedy for Staleness in Distributed Asynchronous SGD
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[114]  arXiv:1911.13211 (replaced) [pdf, other]
Title: Embedding and learning with signatures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[115]  arXiv:2002.04017 (replaced) [pdf, ps, other]
Title: Provable Self-Play Algorithms for Competitive Reinforcement Learning
Authors: Yu Bai, Chi Jin
Comments: Appearing at ICML 2020. Fixed typos from v1
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[116]  arXiv:2002.10099 (replaced) [pdf, other]
Title: Implicit Geometric Regularization for Learning Shapes
Comments: 37th International Conference on Machine Learning, Vienna, Austria, 2020
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (stat.ML)
[117]  arXiv:2003.02570 (replaced) [pdf, other]
Title: Train by Reconnect: Decoupling Locations of Weights from their Values
Authors: Yushi Qiu, Reiji Suda
Comments: 15 pages, 15 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[118]  arXiv:2003.05926 (replaced) [pdf, other]
Title: Learning distributed representations of graphs with Geo2DR
Comments: 9 Pages, Revised version accepted at ICML 2020 GRL+ Workshop
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[119]  arXiv:2003.13461 (replaced) [pdf, other]
Title: Adaptive Personalized Federated Learning
Comments: [v2] A new generalization analysis is provided. Also, additional experiments are added
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
[120]  arXiv:2004.03658 (replaced) [pdf, other]
Title: Faithful Embeddings for Knowledge Base Queries
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
[121]  arXiv:2004.06630 (replaced) [pdf]
Title: A logic-based resampling with matching approach to multiple imputation of missing data
Subjects: Methodology (stat.ME)
[122]  arXiv:2004.13962 (replaced) [pdf, other]
Title: Energy Balancing of Covariate Distributions
Subjects: Methodology (stat.ME)
[123]  arXiv:2005.00527 (replaced) [pdf, ps, other]
Title: Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
[124]  arXiv:2005.05080 (replaced) [pdf, other]
Title: Continual Learning Using Task Conditional Neural Networks
Comments: 10 pages
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[125]  arXiv:2005.05587 (replaced) [pdf, ps, other]
Title: Robustness Verification for Classifier Ensembles
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[126]  arXiv:2006.00701 (replaced) [pdf, ps, other]
Title: Locally Differentially Private (Contextual) Bandits Learning
Comments: 19 pages (including appendix)
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[127]  arXiv:2006.11419 (replaced) [pdf, other]
Title: Set-Invariant Constrained Reinforcement Learning with a Meta-Optimizer
Comments: Accepted to ICML 2020 Workshop Theoretical Foundations of RL
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[128]  arXiv:2006.11695 (replaced) [pdf, other]
Title: Learned Uncertainty-Aware (LUNA) Bases for Bayesian Regression using Multi-Headed Auxiliary Networks
Comments: ICML 2020 Workshop on Uncertainty and Robustness in Deep Learning
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[129]  arXiv:2006.13681 (replaced) [pdf, other]
Title: Multi-view Drone-based Geo-localization via Style and Spatial Alignment
Comments: 9 pages 9 figures. arXiv admin note: text overlap with arXiv:2002.12186 by other authors
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[130]  arXiv:2006.15061 (replaced) [pdf, other]
Title: Intrinsic Reward Driven Imitation Learning via Generative Model
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[131]  arXiv:2006.15935 (replaced) [pdf]
Title: Is Japanese gendered language used on Twitter ? A large scale study
Subjects: Computation and Language (cs.CL); Applications (stat.AP)
[132]  arXiv:2007.02153 (replaced) [pdf, other]
Title: An Empirical Bayes Approach to Shrinkage Estimation on the Manifold of Symmetric Positive-Definite Matrices
Comments: 55 pages, 5 figures, journal submission
Subjects: Statistics Theory (math.ST)
[133]  arXiv:2007.02196 (replaced) [pdf, other]
Title: Deep Active Learning via Open Set Recognition
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[134]  arXiv:2007.02523 (replaced) [pdf, other]
Title: Covariate Distribution Aware Meta-learning
Journal-ref: Published in ICML 2020 Lifelong Learning Workshop
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[135]  arXiv:2007.02725 (replaced) [pdf]
Title: The FMRIB Variational Bayesian Inference Tutorial II: Stochastic Variational Bayes
Comments: Example code and exercises associated with this tutorial can be found here: this https URL
Subjects: Methodology (stat.ME); Machine Learning (cs.LG)
[136]  arXiv:2007.03114 (replaced) [pdf, other]
Title: Relaxed Conformal Prediction Cascades for Efficient Inference Over Many Labels
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[137]  arXiv:2007.03167 (replaced) [pdf, other]
Title: Are Ensemble Classifiers Powerful Enough for the Detection and Diagnosis of Intermediate-Severity Faults?
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[138]  arXiv:2007.03506 (replaced) [pdf, other]
Title: Hierarchical nucleation in deep neural networks
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[139]  arXiv:2007.03533 (replaced) [pdf, other]
Title: A Federated F-score Based Ensemble Model for Automatic Rule Extraction
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[140]  arXiv:2007.03641 (replaced) [pdf, ps, other]
Title: One-Bit Compressed Sensing via One-Shot Hard Thresholding
Authors: Jie Shen
Comments: Accepted to The Conference on Uncertainty in Artificial Intelligence (UAI) 2020
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
[141]  arXiv:2007.03762 (replaced) [pdf, other]
Title: Transfer Learning for Electricity Price Forecasting
Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
[142]  arXiv:2007.04002 (replaced) [pdf, other]
Title: Unbiased Lift-based Bidding System
Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
[143]  arXiv:2007.04275 (replaced) [pdf, other]
Title: Graph Neural Networks for the Prediction of Substrate-Specific Organic Reaction Conditions
Comments: 23 pages, 10 tables, 13 figures, to appear in the ICML 2020 Workshop on Graph Representation Learning and Beyond (GRLB)
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[ total of 143 entries: 1-143 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2007, contact, help  (Access key information)