Statistics
New submissions
[ showing up to 1000 entries per page: fewer  more ]
New submissions for Tue, 14 Jul 20
 [1] arXiv:2007.05554 [pdf, other]

Title: Bayesian Optimization of Risk MeasuresComments: The paper is 12 pages and includes 3 figures. The supplement is an additional 11 pages with 2 figures. The paper is currently under review for Neurips 2020Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
We consider Bayesian optimization of objective functions of the form $\rho[ F(x, W) ]$, where $F$ is a blackbox expensivetoevaluate function and $\rho$ denotes either the VaR or CVaR risk measure, computed with respect to the randomness induced by the environmental random variable $W$. Such problems arise in decision making under uncertainty, such as in portfolio optimization and robust systems design. We propose a family of novel Bayesian optimization algorithms that exploit the structure of the objective function to substantially improve sampling efficiency. Instead of modeling the objective function directly as is typical in Bayesian optimization, these algorithms model $F$ as a Gaussian process, and use the implied posterior on the objective function to decide which points to evaluate. We demonstrate the effectiveness of our approach in a variety of numerical experiments.
 [2] arXiv:2007.05610 [pdf, other]

Title: BatchIncremental Triplet Sampling for Training Triplet Networks Using Bayesian Updating TheoremComments: The first two authors contributed equally to this workSubjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Variants of Triplet networks are robust entities for learning a discriminative embedding subspace. There exist different triplet mining approaches for selecting the most suitable training triplets. Some of these mining methods rely on the extreme distances between instances, and some others make use of sampling. However, sampling from stochastic distributions of data rather than sampling merely from the existing embedding instances can provide more discriminative information. In this work, we sample triplets from distributions of data rather than from existing instances. We consider a multivariate normal distribution for the embedding of each class. Using Bayesian updating and conjugate priors, we update the distributions of classes dynamically by receiving the new minibatches of training data. The proposed triplet mining with Bayesian updating can be used with any tripletbased loss function, e.g., tripletloss or Neighborhood Component Analysis (NCA) loss. Accordingly, Our triplet mining approaches are called Bayesian Updating Triplet (BUT) and Bayesian Updating NCA (BUNCA), depending on which loss function is being used. Experimental results on two public datasets, namely MNIST and histopathology colorectal cancer (CRC), substantiate the effectiveness of the proposed triplet mining method.
 [3] arXiv:2007.05627 [pdf, other]

Title: A Performance Guarantee for Spectral ClusteringSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The twostep spectral clustering method, which consists of the Laplacian eigenmap and a rounding step, is a widely used method for graph partitioning. It can be seen as a natural relaxation to the NPhard minimum ratio cut problem. In this paper we study the central question: when is spectral clustering able to find the global solution to the minimum ratio cut problem? First we provide a condition that naturally depends on the intra and intercluster connectivities of a given partition under which we may certify that this partition is the solution to the minimum ratio cut problem. Then we develop a deterministic twotoinfinity norm perturbation bound for the the invariant subspace of the graph Laplacian that corresponds to the $k$ smallest eigenvalues. Finally by combining these two results we give a condition under which spectral clustering is guaranteed to output the global solution to the minimum ratio cut problem, which serves as a performance guarantee for spectral clustering.
 [4] arXiv:2007.05670 [pdf, other]

Title: An Asymptotically Optimal MultiArmed Bandit Algorithm and Hyperparameter OptimizationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
The evaluation of hyperparameters, neural architectures, or data augmentation policies becomes a critical model selection problem in advanced deep learning with a large hyperparameter search space. In this paper, we propose an efficient and robust banditbased algorithm called SubSampling (SS) in the scenario of hyperparameter search evaluation. It evaluates the potential of hyperparameters by the subsamples of observations and is theoretically proved to be optimal under the criterion of cumulative regret. We further combine SS with Bayesian Optimization and develop a novel hyperparameter optimization algorithm called BOSS. Empirical studies validate our theoretical arguments of SS and demonstrate the superior performance of BOSS on a number of applications, including Neural Architecture Search (NAS), Data Augmentation (DA), Object Detection (OD), and Reinforcement Learning (RL).
 [5] arXiv:2007.05692 [pdf, other]

Title: How Does GANbased Semisupervised Learning Work?Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Generative adversarial networks (GANs) have been widely used and have achieved competitive results in semisupervised learning. This paper theoretically analyzes how GANbased semisupervised learning (GANSSL) works. We first prove that, given a fixed generator, optimizing the discriminator of GANSSL is equivalent to optimizing that of supervised learning. Thus, the optimal discriminator in GANSSL is expected to be perfect on labeled data. Then, if the perfect discriminator can further cause the optimization objective to reach its theoretical maximum, the optimal generator will match the true data distribution. Since it is impossible to reach the theoretical maximum in practice, one cannot expect to obtain a perfect generator for generating data, which is apparently different from the objective of GANs. Furthermore, if the labeled data can traverse all connected subdomains of the data manifold, which is reasonable in semisupervised classification, we additionally expect the optimal discriminator in GANSSL to also be perfect on unlabeled data. In conclusion, the minimax optimization in GANSSL will theoretically output a perfect discriminator on both labeled and unlabeled data by unexpectedly learning an imperfect generator, i.e., GANSSL can effectively improve the generalization ability of the discriminator by leveraging unlabeled information.
 [6] arXiv:2007.05709 [pdf, other]

Title: Scoring Interval Forecasts: EqualTailed, Shortest, and Modal IntervalComments: 22 pagesSubjects: Statistics Theory (math.ST)
We consider different types of predictive intervals and ask whether they are elicitable, i.e. are unique minimizers of a loss or scoring function in expectation. The equaltailed interval is elicitable, with a rich class of suitable loss functions, though subject to either translation invariance, or positive homogeneity and differentiability, the Winkler interval score becomes a unique choice. The modal interval also is elicitable, with a sole consistent scoring function, up to equivalence. However, the shortest interval fails to be elicitable relative to practically relevant classes of distributions. These results provide guidance in interval forecast evaluation and support recent choices of performance measures in forecast competitions.
 [7] arXiv:2007.05721 [pdf, other]

Title: Towards Robust Classification with Deep Generative ForestsComments: Presented at the ICML 2020 Workshop on Uncertainty and Robustness in Deep LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Decision Trees and Random Forests are among the most widely used machine learning models, and often achieve stateoftheart performance in tabular, domainagnostic datasets. Nonetheless, being primarily discriminative models they lack principled methods to manipulate the uncertainty of predictions. In this paper, we exploit Generative Forests (GeFs), a recent class of deep probabilistic models that addresses these issues by extending Random Forests to generative models representing the full joint distribution over the feature space. We demonstrate that GeFs are uncertaintyaware classifiers, capable of measuring the robustness of each prediction as well as detecting outofdistribution samples.
 [8] arXiv:2007.05724 [pdf, other]

Title: Learning Randomly Perturbed Structured Predictors for Direct Loss MinimizationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Direct loss minimization is a popular approach for learning predictors over structured label spaces. This approach is computationally appealing as it replaces integration with optimization and allows to propagate gradients in a deep net using lossperturbed prediction. Recently, this technique was extended to generative models, while introducing a randomized predictor that samples a structure from a randomly perturbed score function. In this work, we learn the variance of these randomized structured predictors and show that it balances better between the learned score function and the randomized noise in structured prediction. We demonstrate empirically the effectiveness of learning the balance between the signal and the random noise in structured discrete spaces.
 [9] arXiv:2007.05737 [pdf, ps, other]

Title: Empirical process theory for locally stationary processesSubjects: Statistics Theory (math.ST)
We provide a framework for empirical process theory of locally stationary processes using the functional dependence measure. Our results extend known results for stationary mixing sequences by another common possibility to measure dependence and allow for additional time dependence. We develop maximal inequalities for expectations and provide functional limit theorems and Bernsteintype inequalities. We show their applicability to a variety of situations, for instance we prove the weak functional convergence of the empirical distribution function and uniform convergence rates for kernel density and regression estimation if the observations are locally stationary processes.
 [10] arXiv:2007.05748 [pdf, ps, other]

Title: FrequentismasmodelAuthors: Christian HennigComments: 34 pages no figuresSubjects: Other Statistics (stat.OT); Methodology (stat.ME)
Most statisticians are aware that probability models interpreted in a frequentist manner are not really true in objective reality, but only idealisations. I argue that this is often ignored when actually applying frequentist methods and interpreting the results, and that keeping up the awareness for the essential difference between reality and models can lead to a more appropriate use and interpretation of frequentist models and methods, called frequentismasmodel. This is elaborated showing connections to existing work, appreciating the special role of i.i.d. models and subject matter knowledge, giving an account of how and under what conditions models that are not true can be useful, giving detailed interpretations of tests and confidence intervals, confronting their implicit compatibility logic with the inverse probability logic of Bayesian inference, reinterpreting the role of model assumptions, appreciating robustness, and the role of ``interpretative equivalence'' of models. Epistemic (often referred to as Bayesian) probability shares the issue that its models are only idealisations and not really true for modelling reasoning about uncertainty, meaning that it does not have an essential advantage over frequentism, as is often claimed. Bayesian statistics can be combined with frequentismasmodel, leading to what Gelman and Hennig (2017) call ``falsificationist Bayes''.
 [11] arXiv:2007.05812 [pdf, ps, other]

Title: Exact Bayesian inference for diffusion driven Cox processesSubjects: Methodology (stat.ME)
In this paper we present a novel methodology to perform Bayesian inference for Cox processes in which the intensity function is driven by a diffusion process. The novelty lies on the fact that no discretisation error is involved, despite the nontractability of both the likelihood function and the transition density of the diffusion. The methodology is based on an MCMC algorithm and its exactness is built on retrospective sampling techniques. The efficiency of the methodology is investigated in some simulated examples and its applicability is illustrated in some real data analyses.
 [12] arXiv:2007.05857 [pdf, other]

Title: Reliability of decisions based on tests: Fourier analysis of Boolean decision functionsComments: 41 pages, 4 figuresSubjects: Methodology (stat.ME); Other Statistics (stat.OT)
Items in a test are often used as a basis for making decisions and such tests are therefore required to have good psychometric properties, like unidimensionality. In many cases the sum score is used in combination with a threshold to decide between pass or fail, for instance. Here we consider whether such a decision function is appropriate, without a latent variable model, and which properties of a decision function are desirable. We consider reliability (stability) of the decision function, i.e., does the decision change upon perturbations, or changes in a fraction of the outcomes of the items (measurement error). We are concerned with questions of whether the sum score is the best way to aggregate the items, and if so why. We use ideas from test theory, social choice theory, graphical models, computer science and probability theory to answer these questions. We conclude that a weighted sum score has desirable properties that (i) fit with test theory and is observable (similar to a condition like conditional association), (ii) has the property that a decision is stable (reliable), and (iii) satisfies Rousseau's criterion that the input should match the decision. We use Fourier analysis of Boolean functions to investigate whether a decision function is stable and to figure out which (set of) items has proportionally too large an influence on the decision. To apply these techniques we invoke ideas from graphical models and use a pseudolikelihood factorisation of the probability distribution.
 [13] arXiv:2007.05864 [pdf, other]

Title: Bayesian Deep Ensembles via the Neural Tangent KernelSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We explore the link between deep ensembles and Gaussian processes (GPs) through the lens of the Neural Tangent Kernel (NTK): a recent development in understanding the training dynamics of wide neural networks (NNs). Previous work has shown that even in the infinite width limit, when NNs become GPs, there is no GP posterior interpretation to a deep ensemble trained with squared error loss. We introduce a simple modification to standard deep ensembles training, through addition of a computationallytractable, randomised and untrainable function to each ensemble member, that enables a posterior interpretation in the infinite width limit. When ensembled together, our trained NNs give an approximation to a posterior predictive distribution, and we prove that our Bayesian deep ensembles make more conservative predictions than standard deep ensembles in the infinite width limit. Finally, using finite width NNs we demonstrate that our Bayesian deep ensembles faithfully emulate the analytic posterior predictive when available, and can outperform standard deep ensembles in various outofdistribution settings, for both regression and classification tasks.
 [14] arXiv:2007.05894 [pdf, other]

Title: A Probabilistic Approach to Identifying Run Scoring Advantage in the Order of Playing CricketSubjects: Applications (stat.AP)
In the game of cricket, the result of coin toss is assumed to be one of the determinants of match outcome. The decision to bat first after winning the toss is often taken to make the best use of superior pitch conditions and set a big target for the opponent. However, the opponent may fail to show their natural batting performance in the second innings due to a number of factors, including deteriorated pitch conditions and excessive pressure of chasing a high target score. The advantage of batting first has been highlighted in the literature and expert opinions, however, the effect of batting and bowling order on match outcome has not been investigated well enough to recommend a solution to any potential bias. This study proposes a probability theorybased model to study venuespecific scoring and chasing characteristics of teams under different match outcomes. A total of 1117 oneday international matches held in ten popular venues are analyzed to show substantially high scoring advantage and likelihood when the winning team bat in the first innings. Results suggest that the same 'batfirst' winning team is very unlikely to score or chase such a high score if they were to bat in the second innings. Therefore, the coin toss decision may favor one team over the other. A Bayesian model is proposed to revise the target score for each venue such that the winning and scoring likelihood is equal regardless of the toss decision. The data and source codes have been shared publicly for future research in creating competitive match outcomes by eliminating the advantage of batting order in run scoring.
 [15] arXiv:2007.05940 [pdf, other]

Title: Perfect Sampling of Multivariate Hawkes ProcessSubjects: Applications (stat.AP)
As an extension of selfexciting Hawkes process, the multivariate Hawkes process models counting processes of different types of random events with mutual excitement. In this paper, we present a perfect sampling algorithm that can generate i.i.d. stationary sample paths of multivariate Hawkes process without any transient bias. In addition, we provide an explicit expression of algorithm complexity in model and algorithm parameters and provide numerical schemes to find the optimal parameter set that minimizes the complexity of the perfect sampling algorithm.
 [16] arXiv:2007.05974 [pdf, other]

Title: Point and interval estimation of the target dose using weighted regression modellingSubjects: Methodology (stat.ME)
In a Phase II dosefinding study with a placebo control, a new drug with several dose levels is compared with a placebo to test for the effectiveness of the new drug. The main focus of such studies often lies in the characterization of the doseresponse relationship followed by the estimation of a target dose that leads to a clinically relevant effect over the placebo. This target dose is known as the minimum effective dose (MED) in a drug development study. Several approaches exist that combine multiple comparison procedures with modeling techniques to efficiently estimate the doseresponse model and thereafter select the target dose. Despite the flexibility of the existing approaches, they cannot completely address the model uncertainty in the modelselection step and may lead to target dose estimates that are biased. In this article, we propose two new MED estimation approaches based on weighted regression modeling that are robust against deviations from the doseresponse model assumptions. These approaches are compared with existing approaches with regard to their accuracy in point and interval estimation of the MED. We illustrate by a simulation study that by integrating one of the new dose estimation approaches with the existing doseresponse profile estimation approaches one can take into account the uncertainty of the model selection step.
 [17] arXiv:2007.05994 [pdf, other]

Title: State Space Expectation Propagation: Efficient Inference Schemes for Temporal Gaussian ProcessesComments: Accepted to International Conference on Machine Learning (ICML) 2020Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We formulate approximate Bayesian inference in nonconjugate temporal and spatiotemporal Gaussian process models as a simple parameter update rule applied during Kalman smoothing. This viewpoint encompasses most inference schemes, including expectation propagation (EP), the classical (Extended, Unscented, etc.) Kalman smoothers, and variational inference. We provide a unifying perspective on these algorithms, showing how replacing the power EP moment matching step with linearisation recovers the classical smoothers. EP provides some benefits over the traditional methods via introduction of the socalled cavity distribution, and we combine these benefits with the computational efficiency of linearisation, providing extensive empirical analysis demonstrating the efficacy of various algorithms under this unifying framework. We provide a fast implementation of all methods in JAX.
 [18] arXiv:2007.06011 [pdf, other]

Title: Explaining the data or explaining a model? Shapley values that uncover nonlinear dependenciesComments: 23 pages, 6 figures, 2 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Shapley values have become increasingly popular in the machine learning literature thanks to their attractive axiomatisation, flexibility, and uniqueness in satisfying certain notions of `fairness'. The flexibility arises from the myriad potential forms of the Shapley value \textit{game formulation}. Amongst the consequences of this flexibility is that there are now many types of Shapley values being discussed, with such variety being a source of potential misunderstanding.
To the best of our knowledge, all existing game formulations in the machine learning and statistics literature fall into a category which we name the modeldependent category of game formulations. In this work, we consider an alternative and novel formulation which leads to the first instance of what we call modelindependent Shapley values. These Shapley values use a (nonparametric) measure of nonlinear dependence as the characteristic function. The strength of these Shapley values is in their ability to uncover and attribute nonlinear dependencies amongst features.
We introduce and demonstrate the use of the energy distance correlations, affineinvariant distance correlation, and HilbertShmidt independence criterion as Shapley value characteristic functions. In particular, we demonstrate their potential value for exploratory data analysis and model diagnostics. We conclude with an interesting expository application to a classical medical survey data set.  [19] arXiv:2007.06018 [pdf, other]

Title: Improving Maximum Likelihood Training for Text Generation with Density Ratio EstimationComments: Accepted to International Conference on Artificial Intelligence and Statistics 2020Subjects: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
Autoregressive sequence generative models trained by Maximum Likelihood Estimation suffer the exposure bias problem in practical finite sample scenarios. The crux is that the number of training samples for Maximum Likelihood Estimation is usually limited and the input data distributions are different at training and inference stages. Many method shave been proposed to solve the above problem (Yu et al., 2017; Lu et al., 2018), which relies on sampling from the nonstationary model distribution and suffers from high variance or biased estimations. In this paper, we propose{\psi}MLE, a new training scheme for autoregressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation. We derive our algorithm from a new perspective of selfaugmentation and introduce bias correction with density ratio estimation. Extensive experimental results on synthetic data and realworld text generation tasks demonstrate that our method stably outperforms Maximum Likelihood Estimation and other stateoftheart sequence generative models in terms of both quality and diversity.
 [20] arXiv:2007.06037 [pdf, other]

Title: Estimating Stochastic Poisson Intensities Using Deep Latent ModelsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We present methodology for estimating the stochastic intensity of a doubly stochastic Poisson process. Statistical and theoretical analyses of traffic traces show that these processes are appropriate models of high intensity traffic arriving at an array of service systems. The statistical estimation of the underlying latent stochastic intensity process driving the traffic model involves a rather complicated nonlinear filtering problem. We develop a novel simulation methodology, using deep neural networks to approximate the path measures induced by the stochastic intensity process, for solving this nonlinear filtering problem. Our simulation studies demonstrate that the method is quite accurate on both insample estimation and on an outofsample performance prediction task for an infinite server queue.
 [21] arXiv:2007.06038 [pdf, other]

Title: Fiducial Matching for the Approximate Posterior: FABCAuthors: Yannis G. YatracosSubjects: Methodology (stat.ME)
FABC is introduced, using universal sufficient statistics, unlike previous ABC papers, e.g. Bernton et al. (2019), and avoiding in the approximate posterior artifacts due to a Kernel. The nature of matching tolerance is examined and indications for determining its values are presented. FABC does not face concerns associated with ABC. Asymptotics and simulation results are also presented.
 [22] arXiv:2007.06054 [pdf, other]

Title: Robust and flexible inference for the covariatespecific ROC curveAuthors: Vanda Inacio, Vanda M. Lourenco, Miguel de Carvalho, Richard A. Parker, Vincent GnanapragasamSubjects: Methodology (stat.ME)
Diagnostic tests are of critical importance in health care and medical research. Motivated by the impact that atypical and outlying test outcomes might have on the assessment of the discriminatory ability of a diagnostic test, we develop a flexible and robust model for conducting inference about the covariatespecific receiver operating characteristic (ROC) curve that safeguards against outlying test results while also accommodating for possible nonlinear effects of the covariates. Specifically, we postulate a locationscale additive regression model for the test outcomes in in both the diseased and nondiseased populations, combining additive cubic Bsplines and Mestimation for the regression function, while the residuals are estimated via a weighted empirical distribution function. The results of the simulation study show that our approach successfully recovers the true covariatespecific ROC curve and corresponding area under the curve on a variety of conceivable test outcomes contamination scenarios. Our method is applied to a dataset derived from a prostate cancer study where we seek to assess the ability of the Prostate Health Index to discriminate between men with and without Gleason 7 or above prostate cancer, and if and how such discriminatory capacity changes with age.
 [23] arXiv:2007.06065 [pdf, other]

Title: The Effects of Vacant Lot Greening and the Impact of Land Use and Business VibrancySubjects: Other Statistics (stat.OT)
We examine the ongoing Philadelphia LandCare (PLC) vacant lot greening initiative and evaluate the association between this built environment intervention and changes in crime incidence. We develop a propensity score matching analysis that estimates the effect of vacant lot greening on different types of crime while accounting for substantial differences between greened and ungreened lots in terms of their surrounding demographic, economic, land use and business vibrancy characteristics. Within these matched pairs of greened vs. ungreened vacant lots, we estimate larger and more significant beneficial effects of greening for reducing violent, nonviolent and total crime compared to comparisons of greened vs. ungreened lots without matching. We also investigate the impact of land use zoning and business vibrancy and find that the effect of vacant lot greening on total crime is substantially affected by particular types of surrounding land use zoning and the presence of certain business types.
 [24] arXiv:2007.06072 [pdf, other]

Title: A spectral algorithm for robust regression with subgaussian ratesAuthors: Jules DepersinSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
We study a new linear up to quadratic time algorithm for linear regression in the absence of strong assumptions on the underlying distributions of samples, and in the presence of outliers. The goal is to design a procedure which comes with actual working code that attains the optimal subgaussian error bound even though the data have only finite moments (up to $L_4$) and in the presence of possibly adversarial outliers. A polynomialtime solution to this problem has been recently discovered but has high runtime due to its use of SumofSquare hierarchy programming. At the core of our algorithm is an adaptation of the spectral method introduced for the mean estimation problem to the linear regression problem. As a byproduct we established a connection between the linear regression problem and the furthest hyperplane problem. From a stochastic point of view, in addition to the study of the classical quadratic and multiplier processes we introduce a third empirical process that comes naturally in the study of the statistical properties of the algorithm.
 [25] arXiv:2007.06075 [pdf, other]

Title: Learning latent stochastic differential equations with variational autoencodersSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We present a method for learning latent stochastic differential equations (SDEs) from high dimensional time series data. Given a time series generated from a lower dimensional It\^{o} process, the proposed method uncovers the relevant parameters of the SDE through a selfsupervised learning approach. Using the framework of variational autoencoders (VAEs), we consider a conditional generative model for the data based on the EulerMaruyama approximation of SDE solutions. Furthermore, we use recent results on identifiability of semisupervised learning to show that our model can recover not only the underlying SDE parameters, but also the original latent space, up to an isometry, in the limit of infinite data. We validate the model through a series of different simulated video processing tasks where the underlying SDE is known. Our results suggest that the proposed method effectively learns the underlying SDE, as predicted by the theory.
 [26] arXiv:2007.06076 [pdf, other]

Title: svReg: Structural Varyingcoefficient regression to differentiate how regional brain atrophy affects motor impairment for Huntington disease severity groupsSubjects: Methodology (stat.ME)
For Huntington disease, identification of brain regions related to motor impairment can be useful for developing interventions to alleviate the motor symptom, the major symptom of the disease. However, the effects from the brain regions to motor impairment may vary for different groups of patients. Hence, our interest is not only to identify the brain regions but also to understand how their effects on motor impairment differ by patient groups. This can be cast as a model selection problem for a varyingcoefficient regression. However, this is challenging when there is a prespecified group structure among variables. We propose a novel variable selection method for a varyingcoefficient regression with such structured variables. Our method is empirically shown to select relevant variables consistently. Also, our method screens irrelevant variables better than existing methods. Hence, our method leads to a model with higher sensitivity, lower false discovery rate and higher prediction accuracy than the existing methods. Finally, we found that the effects from the brain regions to motor impairment differ by disease severity of the patients. To the best of our knowledge, our study is the first to identify such interaction effects between the disease severity and brain regions, which indicates the need for customized intervention by disease severity.
 [27] arXiv:2007.06084 [pdf, other]

Title: Bayesian probabilistic models for corporate context, with an application to internal audit activitiesComments: 34 pages, 8 figures, 10 tablesSubjects: Applications (stat.AP)
In this paper we present a business case carried out in Poste Italiane, in the context of fair performance evaluations of human resources engaged in internal audit activities. In addition to the development of a Bayesian network supporting the goal of the Internal Audit unit of Poste Italiane, the work has led to the development of a methodological approach to advanced analytics in corporate context, whose usefulness goes well beyond the specific use case described here. We thus present the different stages of such analytical strategy, from feature selection, to model structure inference and model selection, as a general toolbox that allows a completely transparent and explainable process to support datadriven decisions in business environments.
 [28] arXiv:2007.06096 [pdf, other]

Title: BaCOUn: Bayesian Classifers with OutofDistribution UncertaintyComments: ICML 2020 Workshop on Uncertainty and Robustness in Deep LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Traditional training of deep classifiers yields overconfident models that are not reliable under dataset shift. We propose a Bayesian framework to obtain reliable uncertainty estimates for deep classifiers. Our approach consists of a plugin "generator" used to augment the data with an additional class of points that lie on the boundary of the training data, followed by Bayesian inference on top of features that are trained to distinguish these "outofdistribution" points.
 [29] arXiv:2007.06101 [pdf, ps, other]

Title: Multiple Imputation and Synthetic Data Generation with the R package NPBayesImputeCatSubjects: Computation (stat.CO); Applications (stat.AP)
In many contexts, missing data and disclosure control are ubiquitous and difficult issues. In particular at statistical agencies, the respondentlevel data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents' privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data, and ii) create synthetic data for disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet Process mixtures of products of multinomials (DPMPM) models used in the package, and illustrate various uses of the package using data samples from the American Community Survey (ACS).
 [30] arXiv:2007.06114 [pdf, ps, other]

Title: Simultaneous Feature Selection and Outlier Detection with Optimality GuaranteesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Sparse estimation methods capable of tolerating outliers have been broadly investigated in the last decade. We contribute to this research considering highdimensional regression problems contaminated by multiple meanshift outliers which affect both the response and the design matrix. We develop a general framework for this class of problems and propose the use of mixedinteger programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We characterize the theoretical properties of our approach, i.e. a necessary and sufficient condition for the robustly strong oracle property, which allows the number of features to exponentially increase with the sample size; the optimal estimation of the parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and to warmstart the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through numerical simulations and an application investigating the relationships between the human microbiome and childhood obesity.
 [31] arXiv:2007.06120 [pdf, other]

Title: Fisher AutoEncodersSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
It has been conjectured that the Fisher divergence is more robust to model uncertainty than the conventional KullbackLeibler (KL) divergence. This motivates the design of a new class of robust generative autoencoders (AE) referred to as Fisher autoencoders. Our approach is to design Fisher AEs by minimizing the Fisher divergence between the intractable joint distribution of observed data and latent variables, with that of the postulated/modeled joint distribution. In contrast to KLbased variational AEs (VAEs), the Fisher AE can exactly quantify the distance between the true and the modelbased posterior distributions. Qualitative and quantitative results are provided on both MNIST and celebA datasets demonstrating the competitive performance of Fisher AEs in terms of robustness compared to other AEs such as VAEs and Wasserstein AEs.
 [32] arXiv:2007.06129 [pdf, other]

Title: The Dependent Dirichlet Process and Related ModelsSubjects: Methodology (stat.ME)
Standard regression approaches assume that some finite number of the response distribution characteristics, such as location and scale, change as a (parametric or nonparametric) function of predictors. However, it is not always appropriate to assume a location/scale representation, where the error distribution has unchanging shape over the predictor space. In fact, it often happens in applied research that the distribution of responses under study changes with predictors in ways that cannot be reasonably represented by a finite dimensional functional form. This can seriously affect the answers to the scientific questions of interest, and therefore more general approaches are indeed needed. This gives rise to the study of fully nonparametric regression models. We review some of the main Bayesian approaches that have been employed to define probability models where the complete response distribution may vary flexibly with predictors. We focus on developments based on modifications of the Dirichlet process, historically termed dependent Dirichlet processes, and some of the extensions that have been proposed to tackle this general problem using nonparametric approaches.
 [33] arXiv:2007.06136 [pdf, other]

Title: Bayesian Biclustering Methods with Applications in Computational BiologySubjects: Applications (stat.AP)
Biclustering is a useful approach in analyzing biology data when observations come from heterogeneous groups and have a large number of features. We outline a general Bayesian approach in tackling biclustering problems in high dimensions, and propose three Bayesian biclustering models on categorical data, which increase in complexities in terms of modeling the distributions of features across biclusters. Our proposed methods apply to a wide range of scenarios: from situations where data are distinguished only among a small subset of features but masked by a large amount of noise, to situations where different groups of data are identified by different sets of features, to situations where data exhibits hierarchical structures. Through simulation studies, we show that our methods outperform existing (bi)clustering methods in both identifying clusters and recovering feature distributional patterns across biclusters. We apply our methods to two genetic datasets, though the area of application of our methods is even broader. Our methods show satisfactory performance in real data analysis, and reveal clusterlevel relationships.
 [34] arXiv:2007.06154 [pdf, other]

Title: A comprehensive empirical power comparison of univariate goodnessoffit tests for the Laplace distributionComments: 37 pages, 1 figure, 20 tablesSubjects: Methodology (stat.ME)
In this paper, we do a comprehensive survey of all univariate goodnessoffit tests that we could find in the literature for the Laplace distribution, which amounts to a total of 45 different test statistics. After eliminating duplicates and considering parameters that yield the best power for each test, we obtain a total of 38 different test statistics. An empirical power comparison study of unmatched size is then conducted using Monte Carlo simulations, with 400 alternatives spanning over 20 families of distributions, for various sample sizes and confidence levels. A discussion of the results follows, where the best tests are selected for different classes of alternatives. A similar study was conducted for the normal distribution in Rom\~ao et al. (2010), although on a smaller scale. Our work improves significantly on Puig & Stephens (2000), which was previously the bestknown reference of this kind for the Laplace distribution. All test statistics and alternatives considered here are integrated within the PoweR package for the R software.
 [35] arXiv:2007.06160 [pdf, other]

Title: Nested Dirichlet Process For Population Size Estimation From Multilist Recapture DataComments: 24 pages, 9 figures, submitted to Biometrics for reviewSubjects: Applications (stat.AP); Methodology (stat.ME)
Heterogeneity of response patterns is important in estimating the size of a closed population from multiple recapture data when capture patterns are different over time and location. In this paper, we extend the nonparametric one layer latent class model for multiple recapture data proposed by ManriqueVallier (2016) to a nested latent class model with the first layer modeling individual heterogeneity and the second layer modeling locationtime differences. Locationtime groups with similar recording patterns are in the same top layer latent class and individuals within each top layer class are dependent. The nested latent class model incorporates hierarchical heterogeneity into the modeling to estimate population size from multilist recapture data. This approach leads to more accurate population size estimation and reduced uncertainty. We apply the method to estimating casualties from the Syrian conflict.
 [36] arXiv:2007.06283 [pdf, ps, other]

Title: Functions with average smoothness: structure, algorithms, and learningSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR)
We initiate a program of averagesmoothness analysis for efficiently learning realvalued functions on metric spaces. Rather than using the (global) Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the average is often much smaller than the maximum, this complexity measure can yield considerably sharper generalization bounds  assuming that these admit a refinement where the global Lipschitz constant is replaced by our average of local slopes. Our first major contribution is to obtain just such distributionsensitive bounds. This required overcoming a number of technical challenges, perhaps the most significant of which was bounding the {\em empirical} covering numbers, which can be much worsebehaved than the ambient ones. This in turn is based on a novel Lipschitztype extension, which is a pointwise minimizer of the local slope, and may be of independent interest. Our combinatorial results are accompanied by efficient algorithms for denoising the random sample, as well as guarantees that the extension from the sample to the whole space will continue to be, with high probability, smooth on average. Along the way we discover a surprisingly rich combinatorial and analytic structure in the function class we define.
 [37] arXiv:2007.06298 [pdf, other]

Title: Imputation procedures in surveys using nonparametric and machine learning methods: an empirical comparisonSubjects: Methodology (stat.ME); Computation (stat.CO)
Nonparametric and machine learning methods are flexible methods for obtaining accurate predictions. Nowadays, data sets with a large number of predictors and complex structures are fairly common. In the presence of item nonresponse, nonparametric and machine learning procedures may thus provide a useful alternative to traditional imputation procedures for deriving a set of imputed values. In this paper, we conduct an extensive empirical investigation that compares a number of imputation procedures in terms of bias and efficiency in a wide variety of settings, including highdimensional data sets. The results suggest that a number of machine learning procedures perform very well in terms of bias and efficiency.
 [38] arXiv:2007.06299 [pdf, other]

Title: Monitoring and explainability of models in productionComments: Workshop on Challenges in Deploying and Monitoring Machine Learning Systems (ICML 2020)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The machine learning lifecycle extends beyond the deployment stage. Monitoring deployed models is crucial for continued provision of high quality machine learning enabled services. Key areas include model performance and data monitoring, detecting outliers and data drift using statistical techniques, and providing explanations of historic predictions. We discuss the challenges to successful implementation of solutions in each of these areas with some recent examples of production ready solutions using open source tools.
 [39] arXiv:2007.06352 [pdf, other]

Title: Quantitative Propagation of Chaos for SGD in Wide Neural NetworksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
In this paper, we investigate the limiting behavior of a continuoustime counterpart of the Stochastic Gradient Descent (SGD) algorithm applied to twolayer overparameterized neural networks, as the number or neurons (ie, the size of the hidden layer) $N \to +\infty$. Following a probabilistic approach, we show 'propagation of chaos' for the particle system defined by this continuoustime dynamics under different scenarios, indicating that the statistical interaction between the particles asymptotically vanishes. In particular, we establish quantitative convergence with respect to $N$ of any particle to a solution of a meanfield McKeanVlasov equation in the metric space endowed with the Wasserstein distance. In comparison to previous works on the subject, we consider settings in which the sequence of stepsizes in SGD can potentially depend on the number of neurons and the iterations. We then identify two regimes under which different meanfield limits are obtained, one of them corresponding to an implicitly regularized version of the minimization problem at hand. We perform various experiments on real datasets to validate our theoretical results, assessing the existence of these two regimes on classification problems and illustrating our convergence results.
 [40] arXiv:2007.06357 [pdf, other]

Title: Feasible Inference for Stochastic Volatility in Brownian Semistationary ProcessesComments: 21 pages, 7 figuresSubjects: Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)
This article studies the finite sample behaviour of a number of estimators for the integrated power volatility process of a Brownian semistationary process in the non semimartingale setting. We establish three consistent feasible estimators for the integrated volatility, two derived from parametric methods and one nonparametrically. We then use a simulation study to compare the convergence properties of the estimators to one another, and to a benchmark of an infeasible estimator. We further establish bounds for the asymptotic variance of the infeasible estimator and assess whether a central limit theorem which holds for the infeasible estimator can be translated into a feasible limit theorem for the nonparametric estimator.
 [41] arXiv:2007.06363 [pdf, other]

Title: Orthogonally Decoupled Variational Fourier FeaturesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Sparse inducing points have long been a standard method to fit Gaussian processes to big data. In the last few years, spectral methods that exploit approximations of the covariance kernel have shown to be competitive. In this work we exploit a recently introduced orthogonally decoupled variational basis to combine spectral methods and sparse inducing points methods. We show that the method is competitive with the stateoftheart on synthetic and on realworld data.
 [42] arXiv:2007.06380 [pdf, other]

Title: Synthetic Aperture Radar Image Formation with Uncertainty QuantificationSubjects: Applications (stat.AP); Image and Video Processing (eess.IV)
Synthetic aperture radar (SAR) is a day or night anyweather imaging modality that is an important tool in remote sensing. Most existing SAR image formation methods result in a maximum a posteriori image which approximates the reflectivity of an unknown ground scene. This single image provides no quantification of the certainty with which the features in the estimate should be trusted. In addition, finding the mode is generally not the best way to interrogate a posterior. This paper addresses these issues by introducing a sampling framework to SAR image formation. A hierarchical Bayesian model is constructed using conjugate priors that directly incorporate coherent imaging and the problematic speckle phenomenon which is known to degrade image quality. Samples of the resulting posterior as well as parameters governing speckle and noise are obtained using a Gibbs sampler. These samples may then be used to compute estimates, and also to derive other statistics like variance which aid in uncertainty quantification. The latter information is particularly important in SAR, where ground truth images even for syntheticallycreated examples are typically unknown. An example result using realworld data shows that the samplingbased approach introduced here to SAR image formation provides parameterfree estimates with improved contrast and significantly reduced speckle, as well as unprecedented uncertainty quantification information.
 [43] arXiv:2007.06382 [pdf, ps, other]

Title: A class of iemerging functionsComments: 9 pagesSubjects: Statistics Theory (math.ST)
We describe a general class of iemerging functions and pose the problem of finding iemerging functions outside this class.
 [44] arXiv:2007.06388 [pdf, ps, other]

Title: Adaptive minimax testing for circular convolutionSubjects: Statistics Theory (math.ST)
Given observations from a circular random variable contaminated by an additive measurement error, we consider the problem of minimax optimal goodnessoffit testing in a nonasymptotic framework. We propose direct and indirect testing procedures using a projection approach. The structure of the optimal tests depends on regularity and illposedness parameters of the model, which are unknown in practice. Therefore, adaptive testing strategies that perform optimally over a wide range of regularity and illposedness classes simultaneously are investigated. Considering a multiple testing procedure, we obtain adaptive i.e. assumptionfree procedures and analyse their performance. Compared with the nonadaptive tests, their radii of testing face a deterioration by a logfactor. We show that for testing of uniformity this loss is unavoidable by providing a lower bound. The results are illustrated considering Sobolev spaces and ordinary or super smooth error densities.
 [45] arXiv:2007.06408 [pdf, ps, other]

Title: Strong Uniform Consistency with Rates for Kernel Density Estimators with General Kernels on ManifoldsComments: 44 pagesSubjects: Statistics Theory (math.ST); Probability (math.PR); Machine Learning (stat.ML)
We provide a strong uniform consistency result with the convergence rate for the kernel density estimation on Riemannian manifolds with Riemann integrable kernels (in the ambient Euclidean space). We also provide a strong uniform consistency result for the kernel density estimation on Riemannian manifolds with Lebesgue integrable kernels. The kernels considered in this paper are different from the kernels in the VapnikChervonenkis class that are frequently considered in statistics society. We illustrate the difference when we apply them to estimate probability density function. We also provide the necessary and sufficient condition for a kernel to be Riemann integrable on a submanifold in the Euclidean space.
 [46] arXiv:2007.06461 [pdf, ps, other]

Title: Minimum Relative Entropy Inference for Normal and Monte Carlo DistributionsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We represent affine submanifolds of exponential family distributions as minimum relative entropy submanifolds. With such representation we derive analytical formulas for the inference from partial information on expectations and covariances of multivariate normal distributions; and we improve the numerical implementation via Monte Carlo simulations for the inference from partial information of generalized expectation type.
 [47] arXiv:2007.06476 [pdf, other]

Title: A Latent Mixture Model for Heterogeneous Causal Mechanisms in Mendelian RandomizationComments: 38 pages, 9 figures, 2 tablesSubjects: Applications (stat.AP); Methodology (stat.ME)
Mendelian Randomization (MR) is a popular method in epidemiology and genetics that uses genetic variation as instrumental variables for causal inference. Existing MR methods usually assume most genetic variants are valid instrumental variables that identify a common causal effect. There is a general lack of awareness that this effect homogeneity assumption can be violated when there are multiple causal pathways involved, even if all the instrumental variables are valid. In this article, we introduce a latent mixture model MRPATH that groups instruments that yield similar causal effect estimates together. We develop a MonteCarlo EM algorithm to fit this mixture model, derive approximate confidence intervals for uncertainty quantification, and adopt a modified Bayesian Information Criterion (BIC) for model selection. We verify the efficacy of the MonteCarlo EM algorithm, confidence intervals, and model selection criterion using numerical simulations. We identify potential mechanistic heterogeneity when applying our method to estimate the effect of highdensity lipoprotein cholesterol on coronary heart disease and the effect of adiposity on type II diabetes.
 [48] arXiv:2007.06482 [pdf, other]

Title: Efficient Optimistic Exploration in LinearQuadratic Regulators via Lagrangian RelaxationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We study the explorationexploitation dilemma in the linear quadratic regulator (LQR) setting. Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of \ofulq and cast it into a constrained \textit{extended} LQR problem, where an additional control variable implicitly selects the system dynamics within a confidence interval. We then move to the corresponding Lagrangian formulation for which we prove strong duality. As a result, we show that an $\epsilon$optimistic controller can be computed efficiently by solving at most $O\big(\log(1/\epsilon)\big)$ Riccati equations. Finally, we prove that relaxing the original \ofu problem does not impact the learning performance, thus recovering the $\tilde{O}(\sqrt{T})$ regret of \ofulq. To the best of our knowledge, this is the first computationally efficient confidencebased algorithm for LQR with worstcase optimal regret guarantees.
 [49] arXiv:2007.06541 [pdf, ps, other]

Title: Bayesian Modeling of COVID19 Positivity Rate  the Indiana experienceComments: 13 pages, 7 figures and 2 tables. The numerical results provided were obtained via an updatable R Markdown documentSubjects: Methodology (stat.ME); Populations and Evolution (qbio.PE)
In this short technical report we model, within the Bayesian framework, the rate of positive tests reported by the the State of Indiana, accounting also for the substantial variability (and overdispeartion) in the daily count of the tests performed. The approach we take, results with a simple procedure for prediction, a posteriori, of this rate of 'positivity' and allows for an easy and a straightforward adaptation by any agency tracking daily results of COVID19 tests. The numerical results provided herein were obtained via an updatable R Markdown document.
 [50] arXiv:2007.06543 [pdf, ps, other]

Title: Dynamics of ternary statistical experiments with equilibrium stateComments: 7 pages, 2 figuresJournalref: Journal of Computational & Applied Mathematics, Kiev, 2015, No.2 (119), 37Subjects: Other Statistics (stat.OT)
We study the scenarios of the dynamics of ternary statistical experiments, modeled employing difference equations. The important features are a balance condition and the existence of a steadystate (equilibrium). We give a classification of scenarios of the model evolution which are significantly different between them, depending on the domain of the values of the model basic parameters.
 [51] arXiv:2007.06552 [pdf, ps, other]

Title: Relaxing the I.I.D. Assumption: Adaptive Minimax Optimal Sequential Prediction with Expert AdviceComments: 60 pages. Blair Bilodeau and Jeffrey Negrea are equalcontribution authors; order was determined randomlySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We consider sequential prediction with expert advice when the data are generated stochastically, but the distributions generating the data may vary arbitrarily among some constraint set. We quantify relaxations of the classical I.I.D. assumption in terms of possible constraint sets, with I.I.D. at one extreme, and an adversarial mechanism at the other. The Hedge algorithm, long known to be minimax optimal in the adversarial regime, has recently been shown to also be minimax optimal in the I.I.D. setting. We show that Hedge is suboptimal between these extremes, and present a new algorithm that is adaptively minimax optimal with respect to our relaxations of the I.I.D. assumption, without knowledge of which setting prevails.
 [52] arXiv:2007.06558 [pdf, ps, other]

Title: Fast Global Convergence of Natural Policy Gradient Methods with Entropy RegularizationSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning. This class of methods is often applied in conjunction with entropy regularization  an algorithmic scheme that helps encourage exploration  and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain severely limited even for the tabular setting. This paper develops $\textit{nonasymptotic}$ convergence guarantees for entropyregularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly  or even quadratically once it enters a local region around the optimal policy  when computing optimal value functions of the regularized MDP. Moreover, the algorithm is provably stable vis\`avis inexactness of policy evaluation, and is able to find an $\epsilon$optimal policy for the original MDP when applied to a slightly perturbed MDP. Our convergence results outperform the ones established for unregularized NPG methods (arXiv:1908.00261), and shed light upon the role of entropy regularization in accelerating convergence .
Crosslists for Tue, 14 Jul 20
 [53] arXiv:1801.00718 (crosslist from cs.CE) [pdf, other]

Title: Selective review of offline change point detection methodsJournalref: Signal Processing, 167:107299, 2020Subjects: Computational Engineering, Finance, and Science (cs.CE); Computation (stat.CO); Methodology (stat.ME)
This article presents a selective survey of algorithms for the offline detection of multiple change points in multivariate time series. A general yet structuring methodological strategy is adopted to organize this vast body of work. More precisely, detection algorithms considered in this review are characterized by three elements: a cost function, a search method and a constraint on the number of changes. Each of those elements is described, reviewed and discussed separately. Implementations of the main algorithms described in this article are provided within a Python package called ruptures.
 [54] arXiv:2007.05535 (crosslist from astroph.CO) [pdf, other]

Title: FlowBased Likelihoods for NonGaussian InferenceComments: 14 pages, 6 figures + appendicesSubjects: Cosmology and Nongalactic Astrophysics (astroph.CO); Data Analysis, Statistics and Probability (physics.dataan); Machine Learning (stat.ML)
We investigate the use of datadriven likelihoods to bypass a key assumption made in many scientific analyses, which is that the true likelihood of the data is Gaussian. In particular, we suggest using the optimization targets of flowbased generative models, a class of models that can capture complex distributions by transforming a simple base distribution through layers of nonlinearities. We call these flowbased likelihoods (FBL). We analyze the accuracy and precision of the reconstructed likelihoods on mock Gaussian data, and show that simply gauging the quality of samples drawn from the trained model is not a sufficient indicator that the true likelihood has been learned. We nevertheless demonstrate that the likelihood can be reconstructed to a precision equal to that of sampling error due to a finite sample size. We then apply FBLs to mock weak lensing convergence power spectra, a cosmological observable that is significantly nonGaussian (NG). We find that the FBL captures the NG signatures in the data extremely well, while other commonlyused datadriven likelihoods, such as Gaussian mixture models and independent component analysis, fail to do so. This suggests that works that have found small posterior shifts in NG data with datadriven likelihoods such as these could be underestimating the impact of nonGaussianity in parameter constraints. By introducing a suite of tests that can capture different levels of NG in the data, we show that the success or failure of traditional datadriven likelihoods can be tied back to the structure of the NG in the data. Unlike other methods, the flexibility of the FBL makes it successful at tackling different types of NG simultaneously. Because of this, and consequently their likely applicability across datasets and domains, we encourage their use for inference when sufficient mock data are available for training.
 [55] arXiv:2007.05542 (crosslist from qbio.PE) [pdf, other]

Title: Climate & BCG: Effects on COVID19 Death Growth RatesComments: 17 pages, 10 figures, 6 tablesSubjects: Populations and Evolution (qbio.PE); Applications (stat.AP)
Multiple studies have suggested the spread of COVID19 is affected by factors such as climate, BCG vaccinations, pollution and blood type. We perform a joint study of these factors using the death growth rates of 40 regions worldwide with both machine learning and Bayesian methods. We find weak, nonsignificant (< 3$\sigma$) evidence for temperature and relative humidity as factors in the spread of COVID19 but little or no evidence for BCG vaccination prevalence or $\text{PM}_{2.5}$ pollution. The only variable detected at a statistically significant level (>3$\sigma$) is the rate of positive COVID19 tests, with higher positive rates correlating with higher daily growth of deaths.
 [56] arXiv:2007.05549 (crosslist from cs.LG) [pdf, other]

Title: MetaLearning Requires MetaAugmentationComments: 14 pages, 8 figures. Code at this https URLSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Metalearning algorithms aim to learn two components: a model that predicts targets for a task, and a base learner that quickly updates that model when given examples from a new task. This additional level of learning can be powerful, but it also creates another potential source for overfitting, since we can now overfit in either the model or the base learner. We describe both of these forms of metalearning overfitting, and demonstrate that they appear experimentally in common metalearning benchmarks. We then use an informationtheoretic framework to discuss metaaugmentation, a way to add randomness that discourages the base learner and model from learning trivial solutions that do not generalize to new tasks. We demonstrate that metaaugmentation produces large complementary benefits to recently proposed metaregularization techniques.
 [57] arXiv:2007.05553 (crosslist from cs.CR) [pdf, other]

Title: Differentially private crosssilo federated learningComments: 14 pages, 5 figuresSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Machine Learning (stat.ML)
Strict privacy is of paramount importance in distributed machine learning. Federated learning, with the main idea of communicating only what is needed for learning, has been recently introduced as a general approach for distributed learning to enhance learning and improve security. However, federated learning by itself does not guarantee any privacy for data subjects. To quantify and control how much privacy is compromised in the worstcase, we can use differential privacy.
In this paper we combine additively homomorphic secure summation protocols with differential privacy in the socalled crosssilo federated learning setting. The goal is to learn complex models like neural networks while guaranteeing strict privacy for the individual data subjects. We demonstrate that our proposed solutions give prediction accuracy that is comparable to the nondistributed setting, and are fast enough to enable learning models with millions of parameters in a reasonable time.
To enable learning under strict privacy guarantees that need privacy amplification by subsampling, we present a general algorithm for oblivious distributed subsampling. However, we also argue that when malicious parties are present, a simple approach using distributed Poisson subsampling gives better privacy.
Finally, we show that by leveraging random projections we can further scaleup our approach to larger models while suffering only a modest performance loss.  [58] arXiv:2007.05557 (crosslist from cs.LG) [pdf, other]

Title: Learning Entangled SingleSample Gaussians in the SubsetofSignals ModelComments: Appear in COLT'2020. Updates: corrected comments on existing works; added comparison to median estimatorSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
In the setting of entangled singlesample distributions, the goal is to estimate some common parameter shared by a family of $n$ distributions, given one single sample from each distribution. This paper studies mean estimation for entangled singlesample Gaussians that have a common mean but different unknown variances. We propose the subsetofsignals model where an unknown subset of $m$ variances are bounded by 1 while there are no assumptions on the other variances. In this model, we analyze a simple and natural method based on iteratively averaging the truncated samples, and show that the method achieves error $O \left(\frac{\sqrt{n\ln n}}{m}\right)$ with high probability when $m=\Omega(\sqrt{n\ln n})$, matching existing bounds for this range of $m$. We further prove lower bounds, showing that the error is $\Omega\left(\left(\frac{n}{m^4}\right)^{1/2}\right)$ when $m$ is between $\Omega(\ln n)$ and $O(n^{1/4})$, and the error is $\Omega\left(\left(\frac{n}{m^4}\right)^{1/6}\right)$ when $m$ is between $\Omega(n^{1/4})$ and $O(n^{1  \epsilon})$ for an arbitrarily small $\epsilon>0$, improving existing lower bounds and extending to a wider range of $m$.
 [59] arXiv:2007.05558 (crosslist from cs.LG) [pdf, other]

Title: The Computational Limits of Deep LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Deep learning's recent history has been one of achievement: from triumphing over humans in the game of Go to worldleading performance in image recognition, voice recognition, translation, and other tasks. But this progress has come with a voracious appetite for computing power. This article reports on the computational demands of Deep Learning applications in five prominent application areas and shows that progress in all five is strongly reliant on increases in computing power. Extrapolating forward this reliance reveals that progress along current lines is rapidly becoming economically, technically, and environmentally unsustainable. Thus, continued progress in these applications will require dramatically more computationallyefficient methods, which will either have to come from changes to deep learning or from moving to other machine learning methods.
 [60] arXiv:2007.05565 (crosslist from cs.LG) [pdf, other]

Title: Reverse Annealing for Nonnegative/Binary Matrix FactorizationComments: 9 pages, 5 figuresSubjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Quantum Physics (quantph); Machine Learning (stat.ML)
It was recently shown that quantum annealing can be used as an effective, fast subroutine in certain types of matrix factorization algorithms. The quantum annealing algorithm performed best for quick, approximate answers, but performance rapidly plateaued. In this paper, we utilize reverse annealing instead of forward annealing in the quantum annealing subroutine for nonnegative/binary matrix factorization problems. After an initial global search with forward annealing, reverse annealing performs a series of local searches that refine existing solutions. The combination of forward and reverse annealing significantly improves performance compared to forward annealing alone for all but the shortest run times.
 [61] arXiv:2007.05566 (crosslist from cs.LG) [pdf, other]

Title: Contrastive Training for Improved OutofDistribution DetectionAuthors: Jim Winkens, Rudy Bunel, Abhijit Guha Roy, Robert Stanforth, Vivek Natarajan, Joseph R. Ledsam, Patricia MacWilliams, Pushmeet Kohli, Alan Karthikesalingam, Simon Kohl, Taylan Cemgil, S. M. Ali Eslami, Olaf RonnebergerSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Reliable detection of outofdistribution (OOD) inputs is increasingly understood to be a precondition for deployment of machine learning systems. This paper proposes and investigates the use of contrastive training to boost OOD detection performance. Unlike leading methods for OOD detection, our approach does not require access to examples labeled explicitly as OOD, which can be difficult to collect in practice. We show in extensive experiments that contrastive training significantly helps OOD detection performance on a number of common benchmarks. By introducing and employing the Confusion Log Probability (CLP) score, which quantifies the difficulty of the OOD detection task by capturing the similarity of inlier and outlier datasets, we show that our method especially improves performance in the `near OOD' classes  a particularly challenging setting for previous methods.
 [62] arXiv:2007.05572 (crosslist from cs.LG) [pdf, other]

Title: Variable Skipping for Autoregressive Range Density EstimationComments: ICML 2020. Code released at: this https URLSubjects: Machine Learning (cs.LG); Databases (cs.DB); Machine Learning (stat.ML)
Deep autoregressive models compute point likelihood estimates of individual data points. However, many applications (i.e., database cardinality estimation) require estimating range densities, a capability that is underexplored by current neural density estimation literature. In these applications, fast and accurate range density estimates over highdimensional data directly impact userperceived performance. In this paper, we explore a technique, variable skipping, for accelerating range density estimation over deep autoregressive models. This technique exploits the sparse structure of range density queries to avoid sampling unnecessary variables during approximate inference. We show that variable skipping provides 10100$\times$ efficiency improvements when targeting challenging highquantile error metrics, enables complex applications such as text pattern matching, and can be realized via a simple data augmentation procedure without changing the usual maximum likelihood objective.
 [63] arXiv:2007.05577 (crosslist from cs.LG) [pdf, other]

Title: Vizarel: A System to Help Better Understand RL AgentsComments: Accepted to ICML 2020 Workshop on Human Interpretability in Machine Learning (Spotlight)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
Visualization tools for supervised learning have allowed users to interpret, introspect, and gain intuition for the successes and failures of their models. While reinforcement learning practitioners ask many of the same questions, existing tools are not applicable to the RL setting. In this work, we describe our initial attempt at constructing a prototype of these ideas, through identifying possible features that such a system should encapsulate. Our design is motivated by envisioning the system to be a platform on which to experiment with interpretable reinforcement learning.
 [64] arXiv:2007.05611 (crosslist from cs.LG) [pdf, other]

Title: Deep Contextual Clinical Prediction with Reverse DistillationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Healthcare providers are increasingly using learned methods to predict and understand longterm patient outcomes in order to make meaningful interventions. However, despite innovations in this area, deep learning models often struggle to match performance of shallow linear models in predicting these outcomes, making it difficult to leverage such techniques in practice. In this work, motivated by the task of clinical prediction from insurance claims, we present a new technique called reverse distillation which pretrains deep models by using highperforming linear models for initialization. We make use of the longitudinal structure of insurance claims datasets to develop Self Attention with Reverse Distillation, or SARD, an architecture that utilizes a combination of contextual embedding, temporal embedding and selfattention mechanisms and most critically is trained via reverse distillation. SARD outperforms stateoftheart methods on multiple clinical prediction outcomes, with ablation studies revealing that reverse distillation is a primary driver of these improvements.
 [65] arXiv:2007.05646 (crosslist from cs.LG) [pdf, other]

Title: Transformations between deep neural networksComments: 14 pages, 10 figures, submitted to Neural Information Processing Systems 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose to test, and when possible establish, an equivalence between two different artificial neural networks by attempting to construct a datadriven transformation between them, using manifoldlearning techniques. In particular, we employ diffusion maps with a Mahalanobislike metric. If the construction succeeds, the two networks can be thought of as belonging to the same equivalence class.
We first discuss transformation functions between only the outputs of the two networks; we then also consider transformations that take into account outputs (activations) of a number of internal neurons from each network. In general, Whitney's theorem dictates the number of measurements from one of the networks required to reconstruct each and every feature of the second network. The construction of the transformation function relies on a consistent, intrinsic representation of the network input space.
We illustrate our algorithm by matching neural network pairs trained to learn (a) observations of scalar functions; (b) observations of twodimensional vector fields; and (c) representations of images of a moving threedimensional object (a rotating horse). The construction of such equivalence classes across different network instantiations clearly relates to transfer learning. We also expect that it will be valuable in establishing equivalence between different Machine Learningbased models of the same phenomenon observed through different instruments and by different research groups.  [66] arXiv:2007.05665 (crosslist from cs.LG) [pdf, ps, other]

Title: A Computational Separation between Private Learning and Online LearningAuthors: Mark BunComments: 15 pagesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
A recent line of work has shown a qualitative equivalence between differentially private PAC learning and online learning: A concept class is privately learnable if and only if it is online learnable with a finite mistake bound. However, both directions of this equivalence incur significant losses in both sample and computational efficiency. Studying a special case of this connection, Gonen, Hazan, and Moran (NeurIPS 2019) showed that uniform or highly sampleefficient pureprivate learners can be timeefficiently compiled into online learners. We show that, assuming the existence of oneway functions, such an efficient conversion is impossible even for general pureprivate learners with polynomial sample complexity. This resolves a question of Neel, Roth, and Wu (FOCS 2019).
 [67] arXiv:2007.05675 (crosslist from cs.CV) [pdf, other]

Title: CoarsetoFine PseudoLabeling Guided MetaLearning for FewShot ClassificationSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
To endow neural networks with the potential to learn rapidly from a handful of samples, metalearning blazes a trail to acquire acrosstask knowledge from a variety of fewshot learning tasks. However, most existing metalearning algorithms retain the requirement of finegrained supervision, which is expensive in many applications. In this paper, we show that metalearning models can extract transferable knowledge from coarsegrained supervision for fewshot classification. We propose a weaklysupervised framework, namely Coarsetofine Pseudolabeling Guided MetaLearning (CPGML), to alleviate the need for data annotation. In our framework, the coarsecategories are grouped into pseudo subcategories to construct a task distribution for metatraining, following the cosine distance between the corresponding embedding vectors of images. For better feature representation in this process, we develop Duallevel Discriminative Embedding (DDE) aiming to keep the distance between learned embeddings consistent with the visual similarity and semantic relation of input images simultaneously. Moreover, we propose a taskattention mechanism to reduce the weight of the training tasks with potentially higher label noises based on the observation of tasknonequivalence. Extensive experiments conducted on two hierarchical metalearning benchmarks demonstrate that, under the proposed framework, metalearning models can effectively extract taskindependent knowledge from the roughlygenerated tasks and generalize well to unseen tasks.
 [68] arXiv:2007.05683 (crosslist from cs.LG) [pdf, other]

Title: Batchlevel Experience Replay with Review for Continual LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Continual learning is a branch of deep learning that seeks to strike a balance between learning stability and plasticity. The CVPR 2020 CLVision Continual Learning for Computer Vision challenge is dedicated to evaluating and advancing the current stateoftheart continual learning methods using the CORe50 dataset with three different continual learning scenarios. This paper presents our approach, called Batchlevel Experience Replay with Review, to this challenge. Our team achieved the 1'st place in all three scenarios out of 79 participated teams. The codebase of our implementation is publicly available at https://github.com/RaptorMai/CVPR20_CLVision_challenge
 [69] arXiv:2007.05690 (crosslist from cs.LG) [pdf, other]

Title: Federated Learning's Blessing: FedAvg has Linear SpeedupSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Federated learning (FL) learns a model jointly from a set of participating devices without sharing each other's privately held data. The characteristics of noniid data across the network, low device participation, and the mandate that data remain private bring challenges in understanding the convergence of FL algorithms, particularly in regards to how convergence scales with the number of participating devices. In this paper, we focus on Federated Averaging (FedAvg)the most widely used and effective FL algorithm in use todayand provide a comprehensive study of its convergence rate. Although FedAvg has recently been studied by an emerging line of literature, it remains open as to how FedAvg's convergence scales with the number of participating devices in the FL settinga crucial question whose answer would shed light on the performance of FedAvg in large FL systems. We fill this gap by establishing convergence guarantees for FedAvg under three classes of problems: strongly convex smooth, convex smooth, and overparameterized strongly convex smooth problems. We show that FedAvg enjoys linear speedup in each case, although with different convergence rates. For each class, we also characterize the corresponding convergence rates for the Nesterov accelerated FedAvg algorithm in the FL setting: to the best of our knowledge, these are the first linear speedup guarantees for FedAvg when Nesterov acceleration is used. To accelerate FedAvg, we also design a new momentumbased FL algorithm that further improves the convergence rate in overparameterized linear regression problems. Empirical studies of the algorithms in various settings have supported our theoretical results.
 [70] arXiv:2007.05694 (crosslist from cs.LG) [pdf, other]

Title: LongTerm Planning with Deep Reinforcement Learning on Autonomous DronesAuthors: Ugurkan AtesComments: Submitted to Association for the Advancement of Artificial Intelligence(AAAI) 2020 Fall Symposium SeriesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
In this paper, we study a longterm planning scenario that is based on drone racing competitions held in real life. We conducted this experiment on a framework created for "Game of Drones: Drone Racing Competition" at NeurIPS 2019. The racing environment was created using Microsoft's AirSim Drone Racing Lab. A reinforcement learning agent, a simulated quadrotor in our case, has trained with the Policy Proximal Optimization(PPO) algorithm was able to successfully compete against another simulated quadrotor that was running a classical path planning algorithm. Agent observations consist of data from IMU sensors, GPS coordinates of drone obtained through simulation and opponent drone GPS information. Using opponent drone GPS information during training helps dealing with complex state spaces, serving as expert guidance allows for efficient and stable training process. All experiments performed in this paper can be found and reproduced with code at our GitHub repository
 [71] arXiv:2007.05700 (crosslist from cs.LG) [pdf, other]

Title: MEvolve: StructuralMappingBased Data Augmentation for Graph ClassificationComments: 11 pages, 9 figuresSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Graph classification, which aims to identify the category labels of graphs, plays a significant role in drug classification, toxicity detection, protein analysis etc. However, the limitation of scale in the benchmark datasets makes it easy for graph classification models to fall into overfitting and undergeneralization. To improve this, we introduce data augmentation on graphs (i.e. graph augmentation) and present four methods:random mapping, vertexsimilarity mapping, motifrandom mapping and motifsimilarity mapping, to generate more weakly labeled data for smallscale benchmark datasets via heuristic transformation of graph structures. Furthermore, we propose a generic model evolution framework, named MEvolve, which combines graph augmentation, data filtration and model retraining to optimize pretrained graph classifiers. Experiments on six benchmark datasets demonstrate that the proposed framework helps existing graph classification models alleviate overfitting and undergeneralization in the training on smallscale benchmark datasets, which successfully yields an average improvement of 313% accuracy on graph classification tasks.
 [72] arXiv:2007.05732 (crosslist from cs.LG) [pdf, other]

Title: Online ParameterFree Learning of Multiple Low Variance TasksJournalref: Conference on Uncertainty in Artificial Intelligence (UAI) 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose a method to learn a common bias vector for a growing sequence of lowvariance tasks. Unlike stateoftheart approaches, our method does not require tuning any hyperparameter. Our approach is presented in the nonstatistical setting and can be of two variants. The "aggressive" one updates the bias after each datapoint, the "lazy" one updates the bias only at the end of each task. We derive an acrosstasks regret bound for the method. When compared to stateoftheart approaches, the aggressive variant returns faster rates, the lazy one recovers standard rates, but with no need of tuning hyperparameters. We then adapt the methods to the statistical setting: the aggressive variant becomes a multitask learning method, the lazy one a metalearning method. Experiments confirm the effectiveness of our methods in practice.
 [73] arXiv:2007.05742 (crosslist from cs.LG) [pdf, other]

Title: RelationGuided Representation LearningComments: Appear in Neural NetworksSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Deep autoencoders (DAEs) have achieved great success in learning data representations via the powerful representability of neural networks. But most DAEs only focus on the most dominant structures which are able to reconstruct the data from a latent space and neglect rich latent structural information. In this work, we propose a new representation learning method that explicitly models and leverages sample relations, which in turn is used as supervision to guide the representation learning. Different from previous work, our framework well preserves the relations between samples. Since the prediction of pairwise relations themselves is a fundamental problem, our model adaptively learns them from data. This provides much flexibility to encode real data manifold. The important role of relation and representation learning is evaluated on the clustering task. Extensive experiments on benchmark data sets demonstrate the superiority of our approach. By seeking to embed samples into subspace, we further show that our method can address the largescale and outofsample problem.
 [74] arXiv:2007.05756 (crosslist from cs.CV) [pdf, other]

Title: Generative Graph Perturbations for Scene Graph PredictionAuthors: Boris Knyazev, Harm de Vries, Cătălina Cangea, Graham W. Taylor, Aaron Courville, Eugene BelilovskyComments: this https URL, ICML Workshop 2020 on "ObjectOriented Learning (OOL): Perception, Representation, and Reasoning"Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Inferring objects and their relationships from an image is useful in many applications at the intersection of vision and language. Due to a long tail data distribution, the task is challenging, with the inevitable appearance of zeroshot compositions of objects and relationships at test time. Current models often fail to properly understand a scene in such cases, as during training they only observe a tiny fraction of the distribution corresponding to the most frequent compositions. This motivates us to study whether increasing the diversity of the training distribution, by generating replacement for parts of real scene graphs, can lead to better generalization? We employ generative adversarial networks (GANs) conditioned on scene graphs to generate augmented visual features. To increase their diversity, we propose several strategies to perturb the conditioning. One of them is to use a language model, such as BERT, to synthesize plausible yet still unlikely scene graphs. By evaluating our model on Visual Genome, we obtain both positive and negative results. This prompts us to make several observations that can potentially lead to further improvements.
 [75] arXiv:2007.05758 (crosslist from cs.LG) [pdf]

Title: Feature Interactions in XGBoostComments: 7 pages, 2 FiguresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper, we investigate how feature interactions can be identified to be used as constraints in the gradient boosting tree models using XGBoost's implementation. Our results show that accurate identification of these constraints can help improve the performance of baseline XGBoost model significantly. Further, the improvement in the model structure can also lead to better interpretability.
 [76] arXiv:2007.05783 (crosslist from cs.LG) [pdf]

Title: Simulating multiexit evacuation using deep reinforcement learningComments: 25 pages, 5 figures, submitted to Transactions in GISSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Conventional simulations on multiexit indoor evacuation focus primarily on how to determine a reasonable exit based on numerous factors in a changing environment. Results commonly include some congested and other underutilized exits, especially with massive pedestrians. We propose a multiexit evacuation simulation based on Deep Reinforcement Learning (DRL), referred to as the MultiExitDRL, which involves in a Deep Neural Network (DNN) framework to facilitate statetoaction mapping. The DNN framework applies Rainbow Deep QNetwork (DQN), a DRL algorithm that integrates several advanced DQN methods, to improve data utilization and algorithm stability, and further divides the action space into eight isometric directions for possible pedestrian choices. We compare MultiExitDRL with two conventional multiexit evacuation simulation models in three separate scenarios: 1) varying pedestrian distribution ratios, 2) varying exit width ratios, and 3) varying open schedules for an exit. The results show that MultiExitDRL presents great learning efficiency while reducing the total number of evacuation frames in all designed experiments. In addition, the integration of DRL allows pedestrians to explore other potential exits and helps determine optimal directions, leading to the high efficiency of exit utilization.
 [77] arXiv:2007.05817 (crosslist from cs.CR) [pdf, other]

Title: ManiGen: A Manifold Aided Blackbox Generator of Adversarial ExamplesAuthors: Guanxiong Liu, Issa Khalil, Abdallah Khreishah, Abdulelah Algosaibi, Adel Aldalbahi, Mohammed Alaneem, Abdulaziz Alhumam, Mohammed AnanSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
Machine learning models, especially neural network (NN) classifiers, have acceptable performance and accuracy that leads to their wide adoption in different aspects of our daily lives. The underlying assumption is that these models are generated and used in attack free scenarios. However, it has been shown that neural network based classifiers are vulnerable to adversarial examples. Adversarial examples are inputs with special perturbations that are ignored by human eyes while can mislead NN classifiers. Most of the existing methods for generating such perturbations require a certain level of knowledge about the target classifier, which makes them not very practical. For example, some generators require knowledge of presoftmax logits while others utilize prediction scores.
In this paper, we design a practical blackbox adversarial example generator, dubbed ManiGen. ManiGen does not require any knowledge of the inner state of the target classifier. It generates adversarial examples by searching along the manifold, which is a concise representation of input data. Through extensive set of experiments on different datasets, we show that (1) adversarial examples generated by ManiGen can mislead standalone classifiers by being as successful as the stateoftheart whitebox generator, Carlini, and (2) adversarial examples generated by ManiGen can more effectively attack classifiers with stateoftheart defenses.  [78] arXiv:2007.05824 (crosslist from cs.LG) [pdf, ps, other]

Title: Generalization bound of globally optimal nonconvex neural network training: Transportation map estimation by infinite dimensional Langevin dynamicsAuthors: Taiji SuzukiSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error. Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence. This potentially makes it difficult to directly deal with finite width network; especially in the neural tangent kernel regime, we cannot reveal favorable properties of neural networks beyond kernel methods. To realize more natural analysis, we consider a completely different approach in which we formulate the parameter training as a transportation map estimation and show its global convergence via the theory of the {\it infinite dimensional Langevin dynamics}. This enables us to analyze narrow and wide networks in a unifying manner. Moreover, we give generalization gap and excess risk bounds for the solution obtained by the dynamics. The excess risk bound achieves the socalled fast learning rate. In particular, we show an exponential convergence for a classification problem and a minimax optimal rate for a regression problem.
 [79] arXiv:2007.05825 (crosslist from physics.socph) [pdf, other]

Title: Safer working spaces at coronavirus time: A novel use of antibody testsComments: 24 pages, 11 figures, preliminary preprintSubjects: Physics and Society (physics.socph); Populations and Evolution (qbio.PE); Applications (stat.AP)
As SARSCov 2 spreads worldwide, governments struggle to keep people safe without collapsing the economy. Social distancing and quarantines have proven to be effective measures to save lives, yet their impact on the economy is becoming apparent. The major challenge faced by many countries at this point of the pandemic, is to find a way to keep their critical industries such as health, telecommunications, national security, transportation, food and energy functioning while having a safe environment for their workers. In this paper we propose a novel approach based on periodic SARSCoV 2 antibody testing to reduce the risk of contagious within the working space, and evaluate it using stochastic simulations of the health evolution of the workforce. Our simulations indicate that the proper use of testing and quarantine of workers suspected of being infected can greatly reduce the number of infections while improving the productivity of the company in the long run.
 [80] arXiv:2007.05830 (crosslist from cs.LG) [pdf, other]

Title: AutoEmbedder: A semisupervised DNN embedding system for clusteringAuthors: Abu Quwsar Ohi, M. F. Mridha, Farisa Benta Safir, Md. Abdul Hamid, Muhammad Mostafa MonowarComments: The manuscript is accepted and published in KnowledgeBased SystemJournalref: KnowledgeBased Systems, p.106190 (2020)Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Clustering is widely used in unsupervised learning method that deals with unlabeled data. Deep clustering has become a popular study area that relates clustering with Deep Neural Network (DNN) architecture. Deep clustering method downsamples high dimensional data, which may also relate clustering loss. Deep clustering is also introduced in semisupervised learning (SSL). Most SSL methods depend on pairwise constraint information, which is a matrix containing knowledge if data pairs can be in the same cluster or not. This paper introduces a novel embedding system named AutoEmbedder, that downsamples higher dimensional data to clusterable embedding points. To the best of our knowledge, this is the first research endeavor that relates to traditional classifier DNN architecture with a pairwise loss reduction technique. The training process is semisupervised and uses Siamese network architecture to compute pairwise constraint loss in the feature learning phase. The AutoEmbedder outperforms most of the existing DNN based semisupervised methods tested on famous datasets.
 [81] arXiv:2007.05838 (crosslist from cs.LG) [pdf, other]

Title: Control as Hybrid InferenceSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The field of reinforcement learning can be split into modelbased and modelfree methods. Here, we unify these approaches by casting modelfree policy optimisation as amortised variational inference, and modelbased planning as iterative variational inference, within a `control as hybrid inference' (CHI) framework. We present an implementation of CHI which naturally mediates the balance between iterative and amortised inference. Using a didactic experiment, we demonstrate that the proposed algorithm operates in a modelbased manner at the onset of learning, before converging to a modelfree algorithm once sufficient data have been collected. We verify the scalability of our algorithm on a continuous control benchmark, demonstrating that it outperforms strong modelfree and modelbased baselines. CHI thus provides a principled framework for harnessing the sample efficiency of modelbased planning while retaining the asymptotic performance of modelfree policy optimisation.
 [82] arXiv:2007.05840 (crosslist from cs.LG) [pdf, other]

Title: Representation Learning via AdversariallyContrastive Optimal TransportComments: Accepted at ICML 2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
In this paper, we study the problem of learning compact (lowdimensional) representations for sequential data that captures its implicit spatiotemporal cues. To maximize extraction of such informative cues from the data, we set the problem within the context of contrastive representation learning and to that end propose a novel objective via optimal transport. Specifically, our formulation seeks a lowdimensional subspace representation of the data that jointly (i) maximizes the distance of the data (embedded in this subspace) from an adversarial data distribution under the optimal transport, a.k.a. the Wasserstein distance, (ii) captures the temporal order, and (iii) minimizes the data distortion. To generate the adversarial distribution, we propose a novel framework connecting Wasserstein GANs with a classifier, allowing a principled mechanism for producing good negative distributions for contrastive learning, which is currently a challenging problem. Our full objective is cast as a subspace learning problem on the Grassmann manifold and solved via Riemannian optimization. To empirically study our formulation, we provide experiments on the task of human action recognition in video sequences. Our results demonstrate competitive performance against challenging baselines.
 [83] arXiv:2007.05852 (crosslist from cs.LG) [pdf, other]

Title: Submodular MetaLearningSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
In this paper, we introduce a discrete variant of the metalearning framework. Metalearning aims at exploiting prior experience and data to improve performance on future tasks. By now, there exist numerous formulations for metalearning in the continuous domain. Notably, the ModelAgnostic MetaLearning (MAML) formulation views each task as a continuous optimization problem and based on prior data learns a suitable initialization that can be adapted to new, unseen tasks after a few simple gradient updates. Motivated by this terminology, we propose a novel metalearning framework in the discrete domain where each task is equivalent to maximizing a set function under a cardinality constraint. Our approach aims at using prior data, i.e., previously visited tasks, to train a proper initial solution set that can be quickly adapted to a new task at a relatively low computational cost. This approach leads to (i) a personalized solution for each individual task, and (ii) significantly reduced computational cost at test time compared to the case where the solution is fully optimized once the new task is revealed. The training procedure is performed by solving a challenging discrete optimization problem for which we present deterministic and randomized algorithms. In the case where the tasks are monotone and submodular, we show strong theoretical guarantees for our proposed methods even though the training objective may not be submodular. We also demonstrate the effectiveness of our framework on two realworld problem instances where we observe that our methods lead to a significant reduction in computational complexity in solving the new tasks while incurring a small performance loss compared to when the tasks are fully optimized.
 [84] arXiv:2007.05860 (crosslist from math.OC) [pdf, other]

Title: Solving Bayesian Risk Optimization via Nested Stochastic Gradient EstimationComments: The paper is 21 pages with 3 figures. The supplement is an additional 16 pages. The paper is currently under review at IISE TransactionsSubjects: Optimization and Control (math.OC); Computation (stat.CO)
In this paper, we aim to solve Bayesian Risk Optimization (BRO), which is a recently proposed framework that formulates simulation optimization under input uncertainty. In order to efficiently solve the BRO problem, we derive nested stochastic gradient estimators and propose corresponding stochastic approximation algorithms. We show that our gradient estimators are asymptotically unbiased and consistent, and that the algorithms converge asymptotically. We demonstrate the empirical performance of the algorithms on a twosided market model. Our estimators are of independent interest in extending the literature of stochastic gradient estimation to the case of nested risk functions.
 [85] arXiv:2007.05869 (crosslist from cs.LG) [pdf, other]

Title: AdversariallyTrained Deep Nets Transfer BetterSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Transfer learning has emerged as a powerful methodology for adapting pretrained deep neural networks to new domains. This process consists of taking a neural network pretrained on a large featurerich source dataset, freezing the early layers that encode essential generic image properties, and then finetuning the last few layers in order to capture specific information related to the target situation. This approach is particularly useful when only limited or weakly labelled data are available for the new task. In this work, we demonstrate that adversariallytrained models transfer better across new domains than naturallytrained models, even though it's known that these models do not generalize as well as naturallytrained models on the source domain. We show that this behavior results from a bias, introduced by the adversarial training, that pushes the learned inner layers to more natural image representations, which in turn enables better transfer.
 [86] arXiv:2007.05879 (crosslist from cs.LG) [pdf, other]

Title: On Improving Hotspot Detection Through Synthetic PatternBased Database EnhancementSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Continuous technology scaling and the introduction of advanced technology nodes in Integrated Circuit (IC) fabrication is constantly exposing new manufacturability issues. One such issue, stemming from complex interaction between design and process, is the problem of design hotspots. Such hotspots are known to vary from design to design and, ideally, should be predicted early and corrected in the design stage itself, as opposed to relying on the foundry to develop process fixes for every hotspot, which would be intractable. In the past, various efforts have been made to address this issue by using a known database of hotspots as the source of information. The majority of these efforts use either Machine Learning (ML) or Pattern Matching (PM) techniques to identify and predict hotspots in new incoming designs. However, almost all of them suffer from high falsealarm rates, mainly because they are oblivious to the root causes of hotspots. In this work, we seek to address this limitation by using a novel database enhancement approach through synthetic pattern generation based on carefully crafted Design of Experiments (DOEs). Effectiveness of the proposed method against the stateoftheart is evaluated on a 45nm process using industrystandard tools and designs.
 [87] arXiv:2007.05881 (crosslist from cs.LG) [pdf, other]

Title: Applying recent advances in Visual Question Answering to Record LinkageAuthors: Marko SmilevskiComments: 48 pages, 15 figures, 6 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (stat.ML)
Multimodal Record Linkage is the process of matching multimodal records from multiple sources that represent the same entity. This field has not been explored in research and we propose two solutions based on Deep Learning architectures that are inspired by recent work in Visual Question Answering. The neural networks we propose use two different fusion modules, the Recurrent Neural Network + Convolutional Neural Network fusion module and the Stacked Attention Network fusion module, that jointly combine the visual and the textual data of the records. The output of these fusion models is the input of a Siamese Neural Network that computes the similarity of the records. Using data from the Avito Duplicate Advertisements Detection dataset, we train these solutions and from the experiments, we concluded that the Recurrent Neural Network + Convolutional Neural Network fusion module outperforms a simple model that uses handcrafted features. We also find that the Recurrent Neural Network + Convolutional Neural Network fusion module classifies dissimilar advertisements as similar more frequently if their average description is bigger than 40 words. We conclude that the reason for this is that the longer advertisements have a different distribution then the shorter advertisements who are more prevalent in the dataset. In the end, we also conclude that further research needs to be done with the Stacked Attention Network, to further explore the effects of the visual data on the performance of the fusion modules.
 [88] arXiv:2007.05896 (crosslist from cs.LG) [pdf, other]

Title: Learning Abstract Models for Strategic Exploration and Fast Reward TransferAuthors: Evan Zheran Liu, Ramtin Keramati, Sudarshan Seshadri, Kelvin Guu, Panupong Pasupat, Emma Brunskill, Percy LiangSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Modelbased reinforcement learning (RL) is appealing because (i) it enables planning and thus more strategic exploration, and (ii) by decoupling dynamics from rewards, it enables fast transfer to new reward functions. However, learning an accurate Markov Decision Process (MDP) over highdimensional states (e.g., raw pixels) is extremely challenging because it requires function approximation, which leads to compounding errors. Instead, to avoid compounding errors, we propose learning an abstract MDP over abstract states: lowdimensional coarse representations of the state (e.g., capturing agent position, ignoring other objects). We assume access to an abstraction function that maps the concrete states to abstract states. In our approach, we construct an abstract MDP, which grows through strategic exploration via planning. Similar to hierarchical RL approaches, the abstract actions of the abstract MDP are backed by learned subpolicies that navigate between abstract states. Our approach achieves strong results on three of the hardest Arcade Learning Environment games (Montezuma's Revenge, Pitfall!, and Private Eye), including superhuman performance on Pitfall! without demonstrations. After training on one task, we can reuse the learned abstract MDP for new reward functions, achieving higher reward in 1000x fewer samples than modelfree methods trained from scratch.
 [89] arXiv:2007.05912 (crosslist from cs.DS) [pdf, ps, other]

Title: Robust Learning of Mixtures of GaussiansAuthors: Daniel M. KaneSubjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
We resolve one of the major outstanding problems in robust statistics. In particular, if $X$ is an evenly weighted mixture of two arbitrary $d$dimensional Gaussians, we devise a polynomial time algorithm that given access to samples from $X$ an $\eps$fraction of which have been adversarially corrupted, learns $X$ to error $\poly(\eps)$ in total variation distance.
 [90] arXiv:2007.05929 (crosslist from cs.LG) [pdf, other]

Title: DataEfficient Reinforcement Learning with Momentum Predictive RepresentationsComments: The first two authors contributed equally to this workSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
While deep reinforcement learning excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment, learning from limited interaction remains a key challenge. We posit that an agent can learn more efficiently if we augment reward maximization with selfsupervised objectives based on structure in its visual input and sequential interaction with the environment. Our method, Momentum Predictive Representations (MPR), trains an agent to predict its own latent state representations multiple steps into the future. We compute target representations for future states using an encoder which is an exponential moving average of the agent's parameters, and we make predictions using a learned transition model. On its own, this future prediction objective outperforms prior methods for sampleefficient deep RL from pixels. We further improve performance by adding data augmentation to the future prediction loss, which forces the agent's representations to be consistent across multiple views of an observation. Our full selfsupervised objective, which combines future prediction and data augmentation, achieves a median humannormalized score of 0.444 on Atari in a setting limited to 100K steps of environment interaction, which is a 66% relative improvement over the previous stateoftheart. Moreover, even in this limited data regime, MPR exceeds expert human scores on 6 out of 26 games.
 [91] arXiv:2007.05943 (crosslist from cs.LG) [pdf, other]

Title: On the generalization of Tanimototype kernels to real valued functionsComments: Pages 12, 3 PDF figures, uses arxiv.stySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The Tanimoto kernel (Jaccard index) is a well known tool to describe the similarity between sets of binary attributes. It has been extended to the case when the attributes are nonnegative real values. This paper introduces a more general Tanimoto kernel formulation which allows to measure the similarity of arbitrary realvalued functions. This extension is constructed by unifying the representation of the attributes via properly chosen sets. After deriving the general form of the kernel, explicit feature representation is extracted from the kernel function, and a simply way of including general kernels into the Tanimoto kernel is shown. Finally, the kernel is also expressed as a quotient of piecewise linear functions, and a smooth approximation is provided.
 [92] arXiv:2007.05970 (crosslist from cs.LG) [pdf, other]

Title: Inverse Graph Identification: Can We Identify Node Labels Given Graph Labels?Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Graph Identification (GI) has long been researched in graph learning and is essential in certain applications (e.g. social community detection). Specifically, GI requires to predict the label/score of a target graph given its collection of node features and edge connections. While this task is common, more complex cases arise in practicewe are supposed to do the inverse thing by, for example, grouping similar users in a social network given the labels of different communities. This triggers an interesting thought: can we identify nodes given the labels of the graphs they belong to? Therefore, this paper defines a novel problem dubbed Inverse Graph Identification (IGI), as opposed to GI. Upon a formal discussion of the variants of IGI, we choose a particular case study of node clustering by making use of the graph labels and node features, with an assistance of a hierarchical graph that further characterizes the connections between different graphs. To address this task, we propose Gaussian Mixture Graph Convolutional Network (GMGCN), a simple yet effective method that makes the nodelevel message passing process using Graph Attention Network (GAT) under the protocol of GI and then infers the category of each node via a Gaussian Mixture Layer (GML). The training of GMGCN is further boosted by a proposed consensus loss to take advantage of the structure of the hierarchical graph. Extensive experiments are conducted to test the rationality of the formulation of IGI. We verify the superiority of the proposed method compared to other baselines on several benchmarks we have built up. We will release our codes along with the benchmark data to facilitate more research attention to the IGI problem.
 [93] arXiv:2007.05975 (crosslist from cs.IT) [pdf, other]

Title: A Graph Symmetrisation Bound on Channel Information Leakage under Blowfish PrivacyComments: 11 pages, 3 figuresSubjects: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
Blowfish privacy is a recent generalisation of differential privacy that enables improved utility while maintaining privacy policies with semantic guarantees, a factor that has driven the popularity of differential privacy in computer science. This paper relates Blowfish privacy to an important measure of privacy loss of information channels from the communications theory community: minentropy leakage. Symmetry in an input data neighbouring relation is central to known connections between differential privacy and minentropy leakage. But while differential privacy exhibits strong symmetry, Blowfish neighbouring relations correspond to arbitrary simple graphs owing to the framework's flexible privacy policies. To bound the minentropy leakage of Blowfishprivate mechanisms we organise our analysis over symmetrical partitions corresponding to orbits of graph automorphism groups. A construction meeting our bound with asymptotic equality demonstrates sharpness.
 [94] arXiv:2007.05986 (crosslist from math.PR) [pdf, ps, other]

Title: Technical Note  Exact simulation of the first passage time of Brownian motion to a symmetric linear boundaryComments: 6 pagesSubjects: Probability (math.PR); Statistics Theory (math.ST)
We state an exact simulation scheme for the first passage time of a Brownian motion to a symmetric linear boundary.
 [95] arXiv:2007.06007 (crosslist from cs.LG) [pdf, ps, other]

Title: Universal Approximation Power of Deep Neural Networks via Nonlinear Control TheorySubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
In this paper, we explain the universal approximation capabilities of deep neural networks through geometric nonlinear control. Inspired by recent work establishing links between residual networks and control systems, we provide a general sufficient condition for a residual network to have the power of universal approximation by asking the activation function, or one of its derivatives, to satisfy a quadratic differential equation. Many activation functions used in practice satisfy this assumption, exactly or approximately, and we show this property to be sufficient for an adequately deep neural network with n states to approximate arbitrarily well any continuous function defined on a compact subset of R^n. We further show this result to hold for very simple architectures, where the weights only need to assume two values. The key technical contribution consists of relating the universal approximation problem to controllability of an ensemble of control systems corresponding to a residual network, and to leverage classical Lie algebraic techniques to characterize controllability.
 [96] arXiv:2007.06024 (crosslist from cs.LG) [pdf, other]

Title: The Impossibility Theorem of Machine Fairness  A Causal PerspectiveAuthors: Kailash Karthik SSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
With the increasing pervasive use of machine learning in social and economic settings, there has been an interest in the notion of machine bias in the AI community. Models trained on historic data reflect the biases that exist in society and are propagated to the future through their decisions. A recent study conducted by ProPublica revealed that the COMPAS recidivism prediction tool was biased against the AfricanAmerican community. There are three prominent metrics of fairness used in the community, and it has been statistically proved that it is impossible to satisfy them at the same time  which has led to ambiguity about the definition of fairness. In this report, causal perspective to the impossibility theorem of fairness is presented along with a causal goal for machine fairness.
 [97] arXiv:2007.06029 (crosslist from cs.LG) [pdf, other]

Title: Ensuring Fairness Beyond the Training DataComments: 18 pages, 3 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We initiate the study of fair classifiers that are robust to perturbations in the training distribution. Despite recent progress, the literature on fairness has largely ignored the design of fair and robust classifiers. In this work, we develop classifiers that are fair not only with respect to the training distribution, but also for a class of distributions that are weighted perturbations of the training samples. We formulate a minmax objective function whose goal is to minimize a distributionally robust training loss, and at the same time, find a classifier that is fair with respect to a class of distributions. We first reduce this problem to finding a fair classifier that is robust with respect to the class of distributions. Based on online learning algorithm, we develop an iterative algorithm that provably converges to such a fair and robust solution. Experiments on standard machine learning fairness datasets suggest that, compared to the stateoftheart fair classifiers, our classifier retains fairness guarantees and test accuracy for a large class of perturbations on the test set. Furthermore, our experiments show that there is an inherent tradeoff between fairness robustness and accuracy of such classifiers.
 [98] arXiv:2007.06049 (crosslist from cs.LG) [pdf, other]

Title: An Equivalence between Loss Functions and NonUniform Sampling in Experience ReplaySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Prioritized Experience Replay (PER) is a deep reinforcement learning technique in which agents learn from transitions sampled with nonuniform probability proportionate to their temporaldifference error. We show that any loss function evaluated with nonuniformly sampled data can be transformed into another uniformly sampled loss function with the same expected gradient. Surprisingly, we find in some environments PER can be replaced entirely by this new loss function without impact to empirical performance. Furthermore, this relationship suggests a new branch of improvements to PER by correcting its uniformly sampled loss function equivalent. We demonstrate the effectiveness of our proposed modifications to PER and the equivalent loss function in several MuJoCo and Atari environments.
 [99] arXiv:2007.06059 (crosslist from cs.LG) [pdf, other]

Title: It Is Likely That Your Loss Should be a LikelihoodSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
We recall that certain common losses are simplified likelihoods and instead argue for optimizing full likelihoods that include their parameters, such as the variance of the normal distribution and the temperature of the softmax distribution. Joint optimization of likelihood and model parameters can adaptively tune the scales and shapes of losses and the weights of regularizers. We survey and systematically evaluate how to parameterize and apply likelihood parameters for robust modeling and recalibration. Additionally, we propose adaptively tuning $L_2$ and $L_1$ weights by fitting the scale parameters of normal and Laplace priors and introduce more flexible elementwise regularizers.
 [100] arXiv:2007.06062 (crosslist from cs.LG) [pdf, other]

Title: Transfer Learning for Activity Recognition in Mobile HealthAuthors: Yuchao Ma, Andrew T. Campbell, Diane J. Cook, John Lach, Shwetak N. Patel, Thomas Ploetz, Majid Sarrafzadeh, Donna SpruijtMetz, Hassan GhasemzadehSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
While activity recognition from inertial sensors holds potential for mobile health, differences in sensing platforms and user movement patterns cause performance degradation. Aiming to address these challenges, we propose a transfer learning framework, TransFall, for sensorbased activity recognition. TransFall's design contains a twotier data transformation, a label estimation layer, and a model generation layer to recognize activities for the new scenario. We validate TransFall analytically and empirically.
 [101] arXiv:2007.06063 (crosslist from cs.LG) [pdf, other]

Title: Exploiting Uncertainties from Ensemble Learners to Improve DecisionMaking in Healthcare AIComments: Preprint of submission to NeurIPS 2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Ensemble learning is widely applied in Machine Learning (ML) to improve model performance and to mitigate decision risks. In this approach, predictions from a diverse set of learners are combined to obtain a joint decision. Recently, various methods have been explored in literature for estimating decision uncertainties using ensemble learning; however, determining which metrics are a better fit for certain decisionmaking applications remains a challenging task. In this paper, we study the following key research question in the selection of uncertainty metrics: when does an uncertainty metric outperforms another? We answer this question via a rigorous analysis of two commonly used uncertainty metrics in ensemble learning, namely ensemble mean and ensemble variance. We show that, under mild assumptions on the ensemble learners, ensemble mean is preferable with respect to ensemble variance as an uncertainty metric for decision making. We empirically validate our assumptions and theoretical results via an extensive case study: the diagnosis of referable diabetic retinopathy.
 [102] arXiv:2007.06068 (crosslist from cs.CV) [pdf, other]

Title: Visualizing Classification Structure in Deep Neural NetworksComments: 2020 ICML Workshop on Human Interpretability in Machine Learning (WHI 2020)Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose a measure to compute class similarity in largescale classification based on prediction scores. Such measure has not been formally proposed in the literature. We show how visualizing the class similarity matrix can reveal hierarchical structures and relationships that govern the classes. Through examples with various classifiers, we demonstrate how such structures can help in analyzing the classification behavior and in inferring potential corner cases. The source code for one example is available as a notebook at https://github.com/bilalsal/blocks
 [103] arXiv:2007.06081 (crosslist from cs.LG) [pdf, other]

Title: VAFL: a Method of Vertical Asynchronous Federated LearningComments: FLICML'20: Proc. of ICML Workshop on Federated Learning for User Privacy and Data Confidentiality, July 2020Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
Horizontal Federated learning (FL) handles multiclient data that share the same set of features, and vertical FL trains a better predictor that combine all the features from different clients. This paper targets solving vertical FL in an asynchronous fashion, and develops a simple FL method. The new method allows each client to run stochastic gradient algorithms without coordination with other clients, so it is suitable for intermittent connectivity of clients. This method further uses a new technique of perturbed local embedding to ensure data privacy and improve communication efficiency. Theoretically, we present the convergence rate and privacy level of our method for strongly convex, nonconvex and even nonsmooth objectives separately. Empirically, we apply our method to FL on various image and healthcare datasets. The results compare favorably to centralized and synchronous FL methods.
 [104] arXiv:2007.06082 (crosslist from quantph) [pdf, other]

Title: Entanglement and Tensor Networks for Supervised Image ClassificationSubjects: Quantum Physics (quantph); Machine Learning (cs.LG); Machine Learning (stat.ML)
Tensor networks, originally designed to address computational problems in quantum manybody physics, have recently been applied to machine learning tasks. However, compared to quantum physics, where the reasons for the success of tensor network approaches over the last 30 years is well understood, very little is yet known about why these techniques work for machine learning. The goal of this paper is to investigate entanglement properties of tensor network models in a current machine learning application, in order to uncover general principles that may guide future developments. We revisit the use of tensor networks for supervised image classification using the MNIST data set of handwritten digits, as pioneered by Stoudenmire and Schwab [Adv. in Neur. Inform. Proc. Sys. 29, 4799 (2016)]. Firstly we hypothesize about which state the tensor network might be learning during training. For that purpose, we propose a plausible candidate state $\Sigma_{\ell}\rangle$ (built as a superposition of product states corresponding to images in the training set) and investigate its entanglement properties. We conclude that $\Sigma_{\ell}\rangle$ is so robustly entangled that it cannot be approximated by the tensor network used in that work, which must therefore be representing a very different state. Secondly, we use tensor networks with a block product structure, in which entanglement is restricted within small blocks of $n \times n$ pixels/qubits. We find that these states are extremely expressive (e.g. training accuracy of $99.97 \%$ already for $n=2$), suggesting that longrange entanglement may not be essential for image classification. However, in our current implementation, optimization leads to overfitting, resulting in test accuracies that are not competitive with other current approaches.
 [105] arXiv:2007.06083 (crosslist from math.PR) [pdf, ps, other]

Title: On almost sure limit theorems for longrange dependent, heavytailed processesAuthors: Michael A. Kouritzin (1), Sounak Paul (2) ((1) University of Alberta, (2) University of Chicago)Subjects: Probability (math.PR); Statistics Theory (math.ST)
Classical methods of inference are often rendered inapplicable while dealing with data exhibiting heavy tails, which gives rise to infinite variance and frequent extremes, and long memory, which induces inertia in the data. In this paper, we develop the Marcinkiewicz strong law of large numbers, ${n^{\frac1p}}\sum_{k=1}^{n} (d_{k} d)\rightarrow 0\ $ almost surely with $p\in(1,2)$, for products $d_k=\prod_{r=1}^s x_k^{(r)}$, where each $x_k^{(r)} = \sum_{l=\infty}^{\infty}c_{kl}^{(r)}\xi_l^{(r)}$ is a twosided univariate linear process with coefficients $\{c_l^{(r)}\}_{l\in \mathbb{Z}}$ and i.i.d. zeromean innovations $\{\xi_l^{(r)}\}_{l\in \mathbb{Z}}$ respectively. The decay of the coefficients $c_l^{(r)}$ as $l\to\infty$, can be slow enough that $\{x_k^{(r)}\}$ can have long memory while $\{d_k\}$ can have heavy tails. The aim of this paper is to handle the longrange dependence and heavy tails for $\{d_k\}$ simultaneously, and to prove a decoupling property that shows the convergence rate is dictated by the worst of longrange dependence and heavy tails, but not their combination. The multivariate linear process case is also considered.
 [106] arXiv:2007.06093 (crosslist from cs.LG) [pdf, other]

Title: Abstract Universal Approximation for Neural NetworksSubjects: Machine Learning (cs.LG); Programming Languages (cs.PL); Machine Learning (stat.ML)
With growing concerns about the safety and robustness of neural networks, a number of researchers have successfully applied abstract interpretation with numerical domains to verify properties of neural networks. Why do numerical domains work for neuralnetwork verification? We present a theoretical result that demonstrates the power of numerical domains, namely, the simple interval domain, for analysis of neural networks. Our main theorem, which we call the abstract universal approximation (AUA) theorem, generalizes the recent result by Baader et al. [2020] for ReLU networks to a rich class of neural networks. The classical universal approximation theorem says that, given function $f$, for any desired precision, there is a neural network that can approximate $f$. The AUA theorem states that for any function $f$, there exists a neural network whose abstract interpretation is an arbitrarily close approximation of the collecting semantics of $f$. Further, the network may be constructed using any wellbehaved activation functionsigmoid, tanh, parametric ReLU, ELU, and moremaking our result quite general.
The implication of the AUA theorem is that there exist provably correct neural networks: Suppose, for instance, that there is an ideal robust image classifier represented as function $f$. The AUA theorem tells us that there exists a neural network that approximates $f$ and for which we can automatically construct proofs of robustness using the interval abstract domain. Our work sheds light on the existence of provably correct neural networks, using arbitrary activation functions, and establishes intriguing connections between wellknown theoretical properties of neural networks and abstract interpretation using numerical domains.  [107] arXiv:2007.06106 (crosslist from cs.LG) [pdf, other]

Title: Unsupervised Feature Selection for Tumor Profiles using Autoencoders and Kernel MethodsSubjects: Machine Learning (cs.LG); Genomics (qbio.GN); Quantitative Methods (qbio.QM); Machine Learning (stat.ML)
Molecular data from tumor profiles is high dimensional. Tumor profiles can be characterized by tens of thousands of gene expression features. Due to the size of the gene expression feature set machine learning methods are exposed to noisy variables and complexity. Tumor types present heterogeneity and can be subdivided in tumor subtypes. In many cases tumor data does not include tumor subtype labeling thus unsupervised learning methods are necessary for tumor subtype discovery. This work aims to learn meaningful and low dimensional representations of tumor samples and find tumor subtype clusters while keeping biological signatures without using tumor labels. The proposed method named Latent Kernel Feature Selection (LKFS) is an unsupervised approach for gene selection in tumor gene expression profiles. By using Autoencoders a low dimensional and denoised latent space is learned as a target representation to guide a Multiple Kernel Learning model that selects a subset of genes. By using the selected genes a clustering method is used to group samples. In order to evaluate the performance of the proposed unsupervised feature selection method the obtained features and clusters are analyzed by clinical significance. The proposed method has been applied on three tumor datasets which are Brain, Renal and Lung, each one composed by two tumor subtypes. When compared with benchmark unsupervised feature selection methods the results obtained by the proposed method reveal lower redundancy in the selected features and a better clustering performance.
 [108] arXiv:2007.06123 (crosslist from cs.SD) [pdf, other]

Title: OtoWorld: Towards Learning to Separate by Learning to MoveComments: Published in Self Supervision in Audio and Speech Workshop, 37th International Conference on Machine Learning, Vienna, Austria (ICML 2020)Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
We present OtoWorld, an interactive environment in which agents must learn to listen in order to solve navigational tasks. The purpose of OtoWorld is to facilitate reinforcement learning research in computer audition, where agents must learn to listen to the world around them to navigate. OtoWorld is built on three open source libraries: OpenAI Gym for environment and agent interaction, PyRoomAcoustics for raytracing and acoustics simulation, and nussl for training deep computer audition models. OtoWorld is the audio analogue of GridWorld, a simple navigation game. OtoWorld can be easily extended to more complex environments and games. To solve one episode of OtoWorld, an agent must move towards each sounding source in the auditory scene and "turn it off". The agent receives no other input than the current sound of the room. The sources are placed randomly within the room and can vary in number. The agent receives a reward for turning off a source. We present preliminary results on the ability of agents to win at OtoWorld. OtoWorld is opensource and available.
 [109] arXiv:2007.06126 (crosslist from cs.LG) [pdf, other]

Title: Disentangled Variational Autoencoder based MultiLabel Classification with CovarianceAware Multivariate Probit ModelSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Multilabel classification is the challenging task of predicting the presence and absence of multiple targets, involving representation learning and label correlation modeling. We propose a novel framework for multilabel classification, Multivariate Probit Variational AutoEncoder (MPVAE), that effectively learns latent embedding spaces as well as label correlations. MPVAE learns and aligns two probabilistic embedding spaces for labels and features respectively. The decoder of MPVAE takes in the samples from the embedding spaces and models the joint distribution of output targets under a Multivariate Probit model by learning a shared covariance matrix. We show that MPVAE outperforms the existing stateoftheart methods on a variety of application domains, using public realworld datasets. MPVAE is further shown to remain robust under noisy settings. Lastly, we demonstrate the interpretability of the learned covariance by a case study on a bird observation dataset.
 [110] arXiv:2007.06133 (crosslist from cs.LG) [pdf, other]

Title: Explainable Recommendation via Interpretable Feature Mapping and Evaluation of ExplainabilityComments: Proceedings of the TwentyNinth International Joint Conference on Artificial Intelligence (IJCAI)Journalref: IJCAI 2020, pages 26902696Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Latent factor collaborative filtering (CF) has been a widely used technique for recommender system by learning the semantic representations of users and items. Recently, explainable recommendation has attracted much attention from research community. However, tradeoff exists between explainability and performance of the recommendation where metadata is often needed to alleviate the dilemma. We present a novel feature mapping approach that maps the uninterpretable general features onto the interpretable aspect features, achieving both satisfactory accuracy and explainability in the recommendations by simultaneous minimization of rating prediction loss and interpretation loss. To evaluate the explainability, we propose two new evaluation metrics specifically designed for aspectlevel explanation using surrogate ground truth. Experimental results demonstrate a strong performance in both recommendation and explaining explanation, eliminating the need for metadata. Code is available from https://github.com/pd90506/AMCF.
 [111] arXiv:2007.06134 (crosslist from cs.LG) [pdf, other]

Title: Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed LearningSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
Stochastic Gradient Descent (SGD) is the key learning algorithm for many machine learning tasks. Because of its computational costs, there is a growing interest in accelerating SGD on HPC resources like GPU clusters. However, the performance of parallel SGD is still bottlenecked by the high communication costs even with a fast connection among the machines. A simple approach to alleviating this problem, used in many existing efforts, is to perform communication every few iterations, using a constant averaging period. In this paper, we show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution. Specifically, we observe that reducing the variance of model parameters among the computing nodes is critical to the convergence of periodic parameter averaging SGD. Given a fixed communication budget, we show that it is more beneficial to synchronize more frequently in early iterations to reduce the initial large variance and synchronize less frequently in the later phase of the training process. We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters, and thus better convergence compared with the Constant Periodic parameter averaging SGD (CPSGD). We evaluate our method with several image classification benchmarks and show that our ADPSGD indeed achieves smaller training losses and higher test accuracies with smaller communication compared with CPSGD. Compared with gradientquantization SGD, we show that our algorithm achieves faster convergence with only half of the communication. Compared with fullcommunication SGD, our ADPSGD achieves 1:14x to 1:27x speedups with a 100Gbps connection among computing nodes, and the speedups increase to 1:46x ~ 1:95x with a 10Gbps connection.
 [112] arXiv:2007.06140 (crosslist from cs.LG) [pdf, other]

Title: Projected Latent Markov Chain Monte Carlo: Conditional Inference with Normalizing FlowsComments: 21 pages, 12 figures, 4 tablesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce Projected Latent Markov Chain Monte Carlo (PLMCMC), a technique for sampling from the highdimensional conditional distributions learned by a normalizing flow. We prove that PLMCMC asymptotically samples from the exact conditional distributions associated with a normalizing flow. As a conditional sampling method, PLMCMC enables Monte Carlo Expectation Maximization (MCEM) training of normalizing flows from incomplete data. By providing experimental results for a variety of data sets, we demonstrate the practicality and effectiveness of PLMCMC for missing data inference using normalizing flows.
 [113] arXiv:2007.06157 (crosslist from cs.LG) [pdf, other]

Title: Implementing the ICE Estimator in Multilayer Perceptron ClassifiersAuthors: Tyler WardSubjects: Machine Learning (cs.LG); Computation (stat.CO)
This paper describes the techniques used to implement the ICE estimator for a multilayer perceptron model, and reviews the performance of the resulting models. The ICE estimator is implemented in the Apache Spark MultilayerPerceptronClassifier, and shown in crossvalidation to outperform the stock MultilayerPerceptronClassifier that uses unadjusted MLE (crossentropy) loss. The resulting models have identical runtime performance, and similar fitting performance to the stock MLP implementations. Additionally, this approach requires no hyperparameters, and is therefore viable as a dropin replacement for crossentropy optimizing multilayer perceptron classifiers wherever overfitting may be a concern.
 [114] arXiv:2007.06159 (crosslist from cs.LG) [pdf, other]

Title: Implicit Distributional Reinforcement LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
To improve the sample efficiency of policygradient based reinforcement learning algorithms, we propose implicit distributional actor critic (IDAC) that consists of a distributional critic, built on two deep generator networks (DGNs), and a semiimplicit actor (SIA), powered by a flexible policy distribution. We adopt a distributional perspective on the discounted cumulative return and model it with a stateactiondependent implicit distribution, which is approximated by the DGNs that take stateaction pairs and random noises as their input. Moreover, we use the SIA to provide a semiimplicit policy distribution, which mixes the policy parameters with a reparameterizable distribution that is not constrained by an analytic density function. In this way, the policy's marginal distribution is implicit, providing the potential to model complex properties such as covariance structure and skewness, but its parameter and entropy can still be estimated. We incorporate these features with an offpolicy algorithm framework to solve problems with continuous action space, and compare IDAC with the stateofart algorithms on representative OpenAI Gym environments. We observe that IDAC outperforms these baselines for most tasks.
 [115] arXiv:2007.06168 (crosslist from cs.LG) [pdf, other]

Title: Model Fusion with KullbackLeibler DivergenceComments: ICML 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose a method to fuse posterior distributions learned from heterogeneous datasets. Our algorithm relies on a mean field assumption for both the fused model and the individual dataset posteriors and proceeds using a simple assignandaverage approach. The components of the dataset posteriors are assigned to the proposed global model components by solving a regularized variant of the assignment problem. The global components are then updated based on these assignments by their mean under a KL divergence. For exponential family variational distributions, our formulation leads to an efficient nonparametric algorithm for computing the fused model. Our algorithm is easy to describe and implement, efficient, and competitive with stateoftheart on motion capture analysis, topic modeling, and federated learning of Bayesian neural networks.
 [116] arXiv:2007.06169 (crosslist from econ.EM) [pdf, other]

Title: An Adversarial Approach to Structural EstimationComments: 58 pages, 3 tables, 4 figuresSubjects: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
We propose a new simulationbased estimation method, adversarial estimation, for structural models. The estimator is formulated as the solution to a minimax problem between a generator (which generates synthetic observations using the structural model) and a discriminator (which classifies if an observation is synthetic). The discriminator maximizes the accuracy of its classification while the generator minimizes it. We show that, with a sufficiently rich discriminator, the adversarial estimator attains parametric efficiency under correct specification and the parametric rate under misspecification. We advocate the use of a neural network as a discriminator that can exploit adaptivity properties and attain fast rates of convergence. We apply our method to the elderly's saving decision model and show that including gender and health profiles in the discriminator uncovers the bequest motive as an important source of saving across the wealth distribution, not only for the rich.
 [117] arXiv:2007.06184 (crosslist from cs.LG) [pdf, other]

Title: Efficient Planning in Large MDPs with Weak Linear Function ApproximationComments: 12 pages and appendix (10 pages). Submitted to the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, CanadaSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Largescale Markov decision processes (MDPs) require planning algorithms with runtime independent of the number of states of the MDP. We consider the planning problem in MDPs using linear value function approximation with only weak requirements: low approximation error for the optimal value function, and a small set of "core" states whose features span those of other states. In particular, we make no assumptions about the representability of policies or value functions of nonoptimal policies. Our algorithm produces almostoptimal actions for any state using a generative oracle (simulator) for the MDP, while its computation time scales polynomially with the number of features, core states, and actions and the effective horizon.
 [118] arXiv:2007.06192 (crosslist from cs.LG) [pdf, other]

Title: Probabilistic bounds on data sensitivity in deep rectifier networksSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
Neuron death is a complex phenomenon with implications for model trainability, but until recently it was measured only empirically. Recent articles have claimed that, as the depth of a rectifier neural network grows to infinity, the probability of finding a valid initialization decreases to zero. In this work, we provide a simple and rigorous proof of that result. Then, we show what happens when the width of each layer grows simultaneously with the depth. We derive both upper and lower bounds on the probability that a ReLU network is initialized to a trainable point, as a function of model hyperparameters. Contrary to previous claims, we show that it is possible to increase the depth of a network indefinitely, so long as the width increases as well. Furthermore, our bounds are asymptotically tight under reasonable assumptions: first, the upper bound coincides with the true probability for a singlelayer network with the largest possible input set. Second, the true probability converges to our lower bound when the network width and depth both grow without limit. Our proof is based on the striking observation that very deep rectifier networks concentrate all outputs towards a single eigenvalue, in the sense that their normalized output variance goes to zero regardless of the network width. Finally, we develop a practical sign flipping scheme which guarantees with probability one that for a $k$layer network, the ratio of living training data points is at least $2^{k}$. We confirm our results with numerical simulations, suggesting that the actual improvement far exceeds the theoretical minimum. We also discuss how neuron death provides a theoretical interpretation for various network design choices such as batch normalization, residual layers and skip connections, and could inform the design of very deep neural networks.
 [119] arXiv:2007.06207 (crosslist from cs.LG) [pdf, other]

Title: DinerDash Gym: A Benchmark for Policy Learning in HighDimensional Action SpaceSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
It has been arduous to assess the progress of a policy learning algorithm in the domain of hierarchical task with high dimensional action space due to the lack of a commonly accepted benchmark. In this work, we propose a new lightweight benchmark task called Diner Dash for evaluating the performance in a complicated task with high dimensional action space. In contrast to the traditional Atari games that only have a flat structure of goals and very few actions, the proposed benchmark task has a hierarchical task structure and size of 57 for the action space and hence can facilitate the development of policy learning in complicated tasks. On top of that, we introduce Decomposed Policy Graph Modelling (DPGM), an algorithm that combines both graph modelling and deep learning to allow explicit domain knowledge embedding and achieves significant improvement comparing to the baseline. In the experiments, we have shown the effectiveness of the domain knowledge injection via a specially designed imitation algorithm as well as results of other popular algorithms.
 [120] arXiv:2007.06225 (crosslist from cs.LG) [pdf]

Title: ProtTrans: Towards Cracking the Language of Life's Code Through SelfSupervised Deep Learning and High Performance ComputingAuthors: Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Debsindhu Bhowmik, Burkhard RostSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
Motivation: NLP continues improving substantially through autoregressive and autoencoding Language Models. These LMs require expensive computing resources for selfsupervised or unsupervised learning from huge unlabelled text corpora. The information learned is transferred through socalled embeddings to downstream prediction tasks. Bioinformatics provide vast goldmines of structured and sequentially ordered text data leading to extraordinarily successful protein sequence LMs that promise new frontiers for generative and predictive tasks at low inference cost. Here, we addressed two questions: (1) To which extent can HPC upscale protein LMs to larger databases and larger models? (2) To which extent can LMs extract features from single proteins to get closer to the performance of methods using evolutionary information?
Methodology: Here, we trained two autoregressive language models (TransformerXL and XLNet) and two autoencoder models (BERT and Albert) using 80 billion amino acids from 200 million protein sequences (UniRef100) and 393 billion amino acids from 2.1 billion protein sequences (BFD). The LMs were trained on the Summit supercomputer, using 5616 GPUs and one TPU Pod, using V3512 cores.
Results: The results of training these LMs on proteins was assessed by predicting secondary structure in three and eightstates (Q3=7583, Q8=6372), localization for 10 cellular compartments (Q10=74) and whether a protein is membranebound or watersoluble (Q2=89). Dimensionality reduction revealed that the LMembeddings from unlabelled data (only protein sequences) captured important biophysical properties of the protein alphabet, namely the amino acids, and their well orchestrated interplay in governing the shape of proteins. In the analogy of NLP, this implied having learned some of the grammar of the language of life realized in protein sequences.  [121] arXiv:2007.06226 (crosslist from cs.LG) [pdf, other]

Title: Neural Network Verification through ReplicationComments: 13 pages, 13 figuresSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
A system identification based approach to neural network model replication is presented and the application of model replication to verification of fundamental, single hidden layer, neural network systems is demonstrated. The presented approach serves as a means to partially address the problem of verifying that a neural network implementation meets a provided specification given only greybox access to the implemented network. The procedure developed involves stimulating a neural network with a chosen signal, extracting a replicated model from the response, and systematically checking that the replicated model is outputequivalent to a specified model in order to verify that the greybox system under test is implemented to specification without direct access to its hidden parameters. The replication step is introduced to provide an inherent guarantee that the stimulus signals employed yield sufficient test coverage. This method is investigated as a neural network focused nonlinear counterpart to the traditional verification of circuits through system identification. A strategy for choosing the stimulus is provided and an algorithm for verifying that the resulting response is indicative of a specificationcompliant neural network system under test is derived. We find that the method can reliably detect defects in small neural networks or in small subcircuits within larger neural networks.
 [122] arXiv:2007.06229 (crosslist from cs.LG) [pdf, other]

Title: Deep Claim: Payer Response Prediction from Claims Data with Deep LearningComments: To be presented at the Healthcare Systems, Population Health, and the Role of HealthTech (HSYS) Workshop at the 37th International Conference on Machine Learning, Vienna, Austria, July 1318, 2020Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
Each year, almost 10% of claims are denied by payers (i.e., health insurance plans). With the cost to recover these denials and underpayments, predicting payer response (likelihood of payment) from claims data with a high degree of accuracy and precision is anticipated to improve healthcare staffs' performance productivity and drive better patient financial experience and satisfaction in the revenue cycle (Barkholz, 2017). However, constructing advanced predictive analytics models has been considered challenging in the last twenty years. That said, we propose a (lowlevel) contextdependent compact representation of patients' historical claim records by effectively learning complicated dependencies in the (highlevel) claim inputs. Built on this new latent representation, we demonstrate that a deep learningbased framework, Deep Claim, can accurately predict various responses from multiple payers using 2,905,026 deidentified claims data from two US health systems. Deep Claim's improvements over carefully chosen baselines in predicting claim denials are most pronounced as 22.21% relative recall gain (at 95% precision) on Health System A, which implies Deep Claim can find 22.21% more denials than the best baseline system.
 [123] arXiv:2007.06230 (crosslist from cs.LG) [pdf, other]

Title: Using LSTM for the Prediction of Disruption in ADITYA TokamakComments: 7 pages, 4 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Major disruptions in tokamak pose a serious threat to the vessel and its surrounding pieces of equipment. The ability of the systems to detect any behavior that can lead to disruption can help in alerting the system beforehand and prevent its harmful effects. Many machine learning techniques have already been in use at large tokamaks like JET and ASDEX, but are not suitable for ADITYA, which is comparatively small. Through this work, we discuss a new realtime approach to predict the time of disruption in ADITYA tokamak and validate the results on an experimental dataset. The system uses selected diagnostics from the tokamak and after some preprocessing steps, sends them to a timesequence Long ShortTerm Memory (LSTM) network. The model can make the predictions 12 ms in advance at less computation cost that is quick enough to be deployed in realtime applications.
 [124] arXiv:2007.06236 (crosslist from cs.LG) [pdf, other]

Title: The Good, The Bad, and The Ugly: Quality Inference in Federated LearningAuthors: Balázs PejóSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Collaborative machine learning algorithms are developed both for efficiency reasons and to ensure the privacy protection of sensitive data used for processing. Federated learning is the most popular of these methods, where 1) learning is done locally, and 2) only a subset of the participants contribute in each training round. Despite of no data is shared explicitly, recent studies showed that models trained with FL could potentially still leak some information. In this paper we focus on the quality property of the datasets and investigate whether the leaked information could be connected to specific participants. Via a differential attack we analyze the information leakage using a few simple metrics, and show that reconstruction of the quality ordering among the training participants' datasets is possible. Our scoring rules are only using an oracle access to a test dataset and no further background information or computational power. We demonstrate two implications of such a quality ordering leakage: 1) we utilized it to increase the accuracy of the model by weighting the participant's updates, and 2) using it to detect misbehaving participants.
 [125] arXiv:2007.06240 (crosslist from cs.CV) [pdf, other]

Title: Expert Training: Task Hardness Aware MetaLearning for FewShot ClassificationComments: 9 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Deep neural networks are highly effective when a large number of labeled samples are available but fail with fewshot classification tasks. Recently, metalearning methods have received much attention, which train a metalearner on massive additional tasks to gain the knowledge to instruct the fewshot classification. Usually, the training tasks are randomly sampled and performed indiscriminately, often making the metalearner stuck into a bad local optimum. Some works in the optimization of deep neural networks have shown that a better arrangement of training data can make the classifier converge faster and perform better. Inspired by this idea, we propose an easytohard expert metatraining strategy to arrange the training tasks properly, where easy tasks are preferred in the first phase, then, hard tasks are emphasized in the second phase. A task hardness aware module is designed and integrated into the training procedure to estimate the hardness of a task based on the distinguishability of its categories. In addition, we explore multiple hardness measurements including the semantic relation, the pairwise Euclidean distance, the Hausdorff distance, and the HilbertSchmidt independence criterion. Experimental results on the miniImageNet and tieredImageNetSketch datasets show that the metalearners can obtain better results with our expert training strategy.
 [126] arXiv:2007.06245 (crosslist from cs.LG) [pdf, other]

Title: Reconstruction Bottlenecks in ObjectCentric Generative ModelsComments: 10 pages, 7 Figures, Workshop on ObjectOriented Learning at ICML 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
A range of methods with suitable inductive biases exist to learn interpretable objectcentric representations of images without supervision. However, these are largely restricted to visually simple images; robust object discovery in realworld sensory datasets remains elusive. To increase the understanding of such inductive biases, we empirically investigate the role of "reconstruction bottlenecks" for scene decomposition in GENESIS, a recent VAEbased model. We show such bottlenecks determine reconstruction and segmentation quality and critically influence model behaviour.
 [127] arXiv:2007.06252 (crosslist from cs.LG) [pdf, other]

Title: ProteiNN: IntrinsicExtrinsic Convolution and Pooling for Scalable Deep Protein AnalysisAuthors: Pedro Hermosilla, Marco Schäfer, Matěj Lang, Gloria Fackelmann, Pere Pau Vázquez, Barbora Kozlíková, Michael Krone, Tobias Ritschel, Timo RopinskiSubjects: Machine Learning (cs.LG); Biomolecules (qbio.BM); Machine Learning (stat.ML)
Proteins perform a large variety of functions in living organisms, thus playing a key role in biology. As of now, available learning algorithms to process protein data do not consider several particularities of such data and/or do not scale well for large protein conformations. To fill this gap, we propose two new learning operations enabling deep 3D analysis of largescale protein data. First, we introduce a novel convolution operator which considers both, the intrinsic (invariant under protein folding) as well as extrinsic (invariant under bonding) structure, by using $n$D convolutions defined on both the Euclidean distance, as well as multiple geodesic distances between atoms in a multigraph. Second, we enable a multiscale protein analysis by introducing hierarchical pooling operators, exploiting the fact that proteins are a recombination of a finite set of amino acids, which can be pooled using shared pooling matrices. Lastly, we evaluate the accuracy of our algorithms on several largescale data sets for common protein analysis tasks, where we outperform stateoftheart methods.
 [128] arXiv:2007.06281 (crosslist from cs.LG) [pdf, other]

Title: Distributed Graph Convolutional NetworksComments: Preprint submitted to IEEE TSIPNSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
The aim of this work is to develop a fullydistributed algorithmic framework for training graph convolutional networks (GCNs). The proposed method is able to exploit the meaningful relational structure of the input data, which are collected by a set of agents that communicate over a sparse network topology. After formulating the centralized GCN training problem, we first show how to make inference in a distributed scenario where the underlying data graph is split among different agents. Then, we propose a distributed gradient descent procedure to solve the GCN training problem. The resulting model distributes computation along three lines: during inference, during backpropagation, and during optimization. Convergence to stationary solutions of the GCN training problem is also established under mild conditions. Finally, we propose an optimization criterion to design the communication topology between agents in order to match with the graph describing data relationships. A wide set of numerical results validate our proposal. To the best of our knowledge, this is the first work combining graph convolutional neural networks with distributed optimization.
 [129] arXiv:2007.06324 (crosslist from cs.LG) [pdf, other]

Title: TrustNet: Learning from Trusted Data Against (A)symmetric Label NoiseSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Robustness to label noise is a critical property for weaklysupervised classifiers trained on massive datasets. Robustness to label noise is a critical property for weaklysupervised classifiers trained on massive datasets. In this paper, we first derive analytical bound for any given noise patterns. Based on the insights, we design TrustNet that first adversely learns the pattern of noise corruption, being it both symmetric or asymmetric, from a small set of trusted data. Then, TrustNet is trained via a robust loss function, which weights the given labels against the inferred labels from the learned noise pattern. The weight is adjusted based on model uncertainty across training epochs. We evaluate TrustNet on synthetic label noise for CIFAR10 and CIFAR100, and realworld data with label noise, i.e., Clothing1M. We compare against stateoftheart methods demonstrating the strong robustness of TrustNet under a diverse set of noise patterns.
 [130] arXiv:2007.06346 (crosslist from cs.LG) [pdf, other]

Title: Whitening for SelfSupervised Representation LearningSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Recent literature on selfsupervised learning is based on the contrastive loss, where image instances which share the same semantic content ("positives") are contrasted with instances extracted from other images ("negatives"). However, in order for the learning to be effective, a lot of negatives should be compared with a positive pair. This is not only computationally demanding, but it also requires that the positive and the negative representations are kept consistent with each other over a long training period. In this paper we propose a different direction and a new loss function for selfsupervised learning which is based on the whitening of the latentspace features. The whitening operation has a "scattering" effect on the batch samples, which compensates the lack of a large number of negatives, avoiding degenerate solutions where all the sample representations collapse to a single point. We empirically show that our loss accelerates selfsupervised training and the learned representations are much more effective for downstream tasks than previously published work.
 [131] arXiv:2007.06368 (crosslist from cs.LG) [pdf, other]

Title: Contextual Bandit with Missing RewardsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We consider a novel variant of the contextual bandit problem (i.e., the multiarmed bandit with sideinformation, or context, available to a decisionmaker) where the reward associated with each contextbased decision may not always be observed("missing rewards"). This new problem is motivated by certain online settings including clinical trial and ad recommendation applications. In order to address the missing rewards setting, we propose to combine the standard contextual bandit approach with an unsupervised learning mechanism such as clustering. Unlike standard contextual bandit methods, by leveraging clustering to estimate missing reward, we are able to learn from each incoming event, even those with missing rewards. Promising empirical results are obtained on several reallife datasets.
 [132] arXiv:2007.06379 (crosslist from cs.LG) [pdf, other]

Title: Rule Covering for Interpretation and BoostingSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose two algorithms for interpretation and boosting of treebased ensemble methods. Both algorithms make use of mathematical programming models that are constructed with a set of rules extracted from an ensemble of decision trees. The objective is to obtain the minimum total impurity with the least number of rules that cover all the samples. The first algorithm uses the collection of decision trees obtained from a trained random forest model. Our numerical results show that the proposed rule covering approach selects only a few rules that could be used for interpreting the random forest model. Moreover, the resulting set of rules closely matches the accuracy level of the random forest model. Inspired by the column generation algorithm in linear programming, our second algorithm uses a rule generation scheme for boosting decision trees. We use the dual optimal solutions of the linear programming models as sample weights to obtain only those rules that would improve the accuracy. With a computational study, we observe that our second algorithm performs competitively with the other wellknown boosting methods. Our implementations also demonstrate that both algorithms can be trivially coupled with the existing random forest and decision tree packages.
 [133] arXiv:2007.06381 (crosslist from cs.LG) [pdf, other]

Title: A simple defense against adversarial attacks on heatmap explanationsComments: Accepted at 2020 Workshop on Human Interpretability in Machine Learning (WHI)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
With machine learning models being used for more sensitive applications, we rely on interpretability methods to prove that no discriminating attributes were used for classification. A potential concern is the socalled "fairwashing"  manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead.
In our work we present an effective defence against such adversarial attacks on neural networks. By a simple aggregation of multiple explanation methods, the network becomes robust against manipulation. This holds even when the attacker has exact knowledge of the model weights and the explanation methods used.  [134] arXiv:2007.06402 (crosslist from cs.CV) [pdf, other]

Title: Nested Learning For MultiGranular TasksSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Standard deep neural networks (DNNs) are commonly trained in an endtoend fashion for specific tasks such as object recognition, face identification, or character recognition, among many examples. This specificity often leads to overconfident models that generalize poorly to samples that are not from the original training distribution. Moreover, such standard DNNs do not allow to leverage information from heterogeneously annotated training data, where for example, labels may be provided with different levels of granularity. Furthermore, DNNs do not produce results with simultaneous different levels of confidence for different levels of detail, they are most commonly an all or nothing approach. To address these challenges, we introduce the concept of nested learning: how to obtain a hierarchical representation of the input such that a coarse label can be extracted first, and sequentially refine this representation, if the sample permits, to obtain successively refined predictions, all of them with the corresponding confidence. We explicitly enforce this behavior by creating a sequence of nested information bottlenecks. Looking at the problem of nested learning from an information theory perspective, we design a network topology with two important properties. First, a sequence of low dimensional (nested) feature embeddings are enforced. Then we show how the explicit combination of nested outputs can improve both the robustness and the accuracy of finer predictions. Experimental results on Cifar10, Cifar100, MNIST, FashionMNIST, Dbpedia, and Plantvillage demonstrate that nested learning outperforms the same network trained in the standard endtoend fashion.
 [135] arXiv:2007.06414 (crosslist from qbio.PE) [pdf, other]

Title: Epidemic modelling of bovine tuberculosis in cattle herds and badgers in IrelandComments: 32 pages, 2 figuresSubjects: Populations and Evolution (qbio.PE); Applications (stat.AP)
Bovine tuberculosis, a disease that affects cattle and badgers in Ireland, was studied via stochastic epidemic modeling using incidence data from the Four Area Project (Griffin et al., 2005). The Four Area Project was a large scale field trial conducted in four diverse farming regions of Ireland over a fiveyear period (19972002) to evaluate the impact of badger culling on bovine tuberculosis incidence in cattle herds.
Based on the comparison of several models, the model with no betweenherd transmission and badgertoherd transmission proportional to the total number of infected badgers culled was best supported by the data.
Detailed model validation was conducted via model prediction, identifiability checks and sensitivity analysis.
The results suggest that badgertocattle transmission is of more importance than betweenherd transmission and that if there was no badgertoherd transmission, levels of bovine tuberculosis in cattle herds in Ireland could decrease considerably.  [136] arXiv:2007.06418 (crosslist from cs.LG) [pdf, other]

Title: Lessons Learned from the Training of GANs on Artificial DatasetsAuthors: Shichang TangSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Generative Adversarial Networks (GANs) have made great progress in synthesizing realistic images in recent years. However, they are often trained on image datasets with either too few samples or too many classes belonging to different data distributions. Consequently, GANs are prone to underfitting or overfitting, making the analysis of them difficult and constrained. Therefore, in order to conduct a thorough study on GANs while obviating unnecessary interferences introduced by the datasets, we train them on artificial datasets where there are infinitely many samples and the real data distributions are simple, highdimensional and have structured manifolds. Moreover, the generators are designed such that optimal sets of parameters exist. Empirically, we find that under various distance measures, the generator fails to learn such parameters with the GAN training procedure. We also find that training mixtures of GANs leads to more performance gain compared to increasing the network depth or width when the model complexity is high enough. Our experimental results demonstrate that a mixture of generators can discover different modes or different classes automatically in an unsupervised setting, which we attribute to the distribution of the generation and discrimination tasks across multiple generators and discriminators. As an example of the generalizability of our conclusions to realistic datasets, we train a mixture of GANs on the CIFAR10 dataset and our method significantly outperforms the stateoftheart in terms of popular metrics, i.e., Inception Score (IS) and Fr\'echet Inception Distance (FID).
 [137] arXiv:2007.06437 (crosslist from cs.LG) [pdf, other]

Title: A Provably Efficient Sample Collection Strategy for Reinforcement LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
A common assumption in reinforcement learning (RL) is to have access to a generative model (i.e., a simulator of the environment), which allows to generate samples from any desired stateaction pair. Nonetheless, in many settings a generative model may not be available and an adaptive exploration strategy is needed to efficiently collect samples from an unknown environment by direct interaction. In this paper, we study the scenario where an algorithm based on the generative model assumption defines the (possibly timevarying) amount of samples $b(s,a)$ required at each stateaction pair $(s,a)$ and an exploration strategy has to learn how to generate $b(s,a)$ samples as fast as possible. Building on recent results for regret minimization in the stochastic shortest path (SSP) setting (Cohen et al., 2020; Tarbouriech et al., 2020), we derive an algorithm that requires $\tilde{O}( B D + D^{3/2} S^2 A)$ time steps to collect the $B = \sum_{s,a} b(s,a)$ desired samples, in any unknown and communicating MDP with $S$ states, $A$ actions and diameter $D$. Leveraging the generality of our strategy, we readily apply it to a variety of existing settings (e.g., model estimation, pure exploration in MDPs) for which we obtain improved samplecomplexity guarantees, and to a set of new problems such as beststate identification and sparse reward discovery.
 [138] arXiv:2007.06503 (crosslist from cs.LG) [pdf, other]

Title: PRIVAE: PrincipleofRelevantInformation Variational AutoencodersSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Although substantial efforts have been made to learn disentangled representations under the variational autoencoder (VAE) framework, the fundamental properties to the dynamics of learning of most VAE models still remain unknown and underinvestigated. In this work, we first propose a novel learning objective, termed the principleofrelevantinformation variational autoencoder (PRIVAE), to learn disentangled representations. We then present an informationtheoretic perspective to analyze existing VAE models by inspecting the evolution of some critical informationtheoretic quantities across training epochs. Our observations unveil some fundamental properties associated with VAEs. Empirical results also demonstrate the effectiveness of PRIVAE on four benchmark data sets.
 [139] arXiv:2007.06528 (crosslist from math.OC) [pdf, other]

Title: Random extrapolation for primaldual coordinate descentComments: To appear in ICML 2020Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce a randomly extrapolated primaldual coordinate descent method that adapts to sparsity of the data matrix and the favorable structures of the objective function. Our method updates only a subset of primal and dual variables with sparse data, and it uses large step sizes with dense data, retaining the benefits of the specific methods designed for each case. In addition to adapting to sparsity, our method attains fast convergence guarantees in favorable cases \textit{without any modifications}. In particular, we prove linear convergence under metric subregularity, which applies to strongly convexstrongly concave problems and piecewise linear quadratic functions. We show almost sure convergence of the sequence and optimal sublinear convergence rates for the primaldual gap and objective values, in the general convexconcave case. Numerical evidence demonstrates the stateoftheart empirical performance of our method in sparse and dense settings, matching and improving the existing methods.
 [140] arXiv:2007.06533 (crosslist from cs.LG) [pdf, other]

Title: S2RMs: Spatially Structured Recurrent ModulesAuthors: Nasim Rahaman, Anirudh Goyal, Muhammad Waleed Gondal, Manuel Wuthrich, Stefan Bauer, Yash Sharma, Yoshua Bengio, Bernhard SchölkopfSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Capturing the structure of a datagenerating process by means of appropriate inductive biases can help in learning models that generalize well and are robust to changes in the input distribution. While methods that harness spatial and temporal structures find broad application, recent work has demonstrated the potential of models that leverage sparse and modular structure using an ensemble of sparingly interacting modules. In this work, we take a step towards dynamic models that are capable of simultaneously exploiting both modular and spatiotemporal structures. We accomplish this by abstracting the modeled dynamical system as a collection of autonomous but sparsely interacting subsystems. The subsystems interact according to a topology that is learned, but also informed by the spatial structure of the underlying realworld system. This results in a class of models that are well suited for modeling the dynamics of systems that only offer local views into their state, along with corresponding spatial locations of those views. On the tasks of video prediction from cropped frames and multiagent world modeling from partial observations in the challenging Starcraft2 domain, we find our models to be more robust to the number of available views and better capable of generalization to novel tasks without additional training, even when compared against strong baselines that perform equally well or better on the training distribution.
 [141] arXiv:2007.06555 (crosslist from cs.LG) [pdf, other]

Title: Adversarial robustness via robust low rank representationsSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
Adversarial robustness measures the susceptibility of a classifier to imperceptible perturbations made to the inputs at test time. In this work we highlight the benefits of natural low rank representations that often exist for real data such as images, for training neural networks with certified robustness guarantees.
Our first contribution is for certified robustness to perturbations measured in $\ell_2$ norm. We exploit low rank data representations to provide improved guarantees over stateoftheart randomized smoothingbased approaches on standard benchmark datasets such as CIFAR10 and CIFAR100.
Our second contribution is for the more challenging setting of certified robustness to perturbations measured in $\ell_\infty$ norm. We demonstrate empirically that natural low rank representations have inherent robustness properties, that can be leveraged to provide significantly better guarantees for certified robustness to $\ell_\infty$ perturbations in those representations. Our certificate of $\ell_\infty$ robustness relies on a natural quantity involving the $\infty \to 2$ matrix operator norm associated with the representation, to translate robustness guarantees from $\ell_2$ to $\ell_\infty$ perturbations.
A key technical ingredient for our certification guarantees is a fast algorithm with provable guarantees based on the multiplicative weights update method to provide upper bounds on the above matrix norm. Our algorithmic guarantees improve upon the state of the art for this problem, and may be of independent interest.  [142] arXiv:2007.06557 (crosslist from cs.SI) [pdf, other]

Title: Scalable Learning of Independent Cascade Dynamics from Partial ObservationsSubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG); Physics and Society (physics.socph); Machine Learning (stat.ML)
Spreading processes play an increasingly important role in modeling for diffusion networks, information propagation, marketing, and opinion setting. Recent realworld spreading events further highlight the need for prediction, optimization, and control of diffusion dynamics. To tackle these tasks, it is essential to learn the effective spreading model and transmission probabilities across the network of interactions. However, in most cases the transmission rates are unknown and need to be inferred from the spreading data. Additionally, full observation of the dynamics is rarely available. As a result, standard approaches such as maximum likelihood quickly become intractable for large network instances. In this work, we study the popular Independent Cascade model of stochastic diffusion dynamics. We introduce a computationally efficient algorithm, based on a scalable dynamic messagepassing approach, which is able to learn parameters of the effective spreading model given only limited information on the activation times of nodes in the network. Importantly, we show that the resulting model approximates the marginal activation probabilities that can be used for prediction of the spread.
 [143] arXiv:2007.06559 (crosslist from cs.LG) [pdf, other]

Title: Graph Structure of Neural NetworksComments: ICML 2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Neural networks are often represented as graphs of connections between neurons. However, despite their wide use, there is currently little understanding of the relationship between the graph structure of the neural network and its predictive performance. Here we systematically investigate how does the graph structure of neural networks affect their predictive performance. To this end, we develop a novel graphbased representation of neural networks called relational graph, where layers of neural network computation correspond to rounds of message exchange along the graph structure. Using this representation we show that: (1) a "sweet spot" of relational graphs leads to neural networks with significantly improved predictive performance; (2) neural network's performance is approximately a smooth function of the clustering coefficient and average path length of its relational graph; (3) our findings are consistent across many different tasks and datasets; (4) the sweet spot can be identified efficiently; (5) topperforming neural networks have graph structure surprisingly similar to those of real biological neural networks. Our work opens new directions for the design of neural architectures and the understanding on neural networks in general.
Replacements for Tue, 14 Jul 20
 [144] arXiv:1712.07248 (replaced) [pdf, ps, other]

Title: Towards a General Large Sample Theory for Regularized EstimatorsSubjects: Statistics Theory (math.ST); Econometrics (econ.EM)
 [145] arXiv:1802.08667 (replaced) [pdf, ps, other]

Title: DeBiased Machine Learning of Global and Local Parameters Using Regularized Riesz RepresentersComments: 41 pages; submitted versionSubjects: Machine Learning (stat.ML); Econometrics (econ.EM); Statistics Theory (math.ST)
 [146] arXiv:1807.04010 (replaced) [pdf, ps, other]

Title: Causal Discovery in the Presence of Missing DataAuthors: Ruibo Tu, Kun Zhang, Paul Ackermann, Bo Christer Bertilson, Clark Glymour, Hedvig Kjellström, Cheng ZhangSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [147] arXiv:1808.08558 (replaced) [pdf, other]

Title: Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization ErrorAuthors: Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai, Masatoshi Yukishima, Tomoaki NishimuraComments: 17 pages, 4 figures. Accepted in IJCAIPRICAI 2020. Proceedings of the TwentyNinth International Joint Conference on Artificial Intelligence, pages 28392846Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [148] arXiv:1809.05224 (replaced) [pdf, ps, other]

Title: Automatic Debiased Machine Learning of Causal and Structural EffectsSubjects: Statistics Theory (math.ST); Econometrics (econ.EM)
 [149] arXiv:1811.00401 (replaced) [pdf, other]

Title: Excessive Invariance Causes Adversarial VulnerabilityJournalref: Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [150] arXiv:1811.03064 (replaced) [pdf, other]

Title: Towards a Near Universal Time Series Data Mining Tool: Introducing the Matrix ProfileAuthors: ChinChia Michael YehComments: PhD dissertation (2018)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [151] arXiv:1901.03904 (replaced) [pdf]

Title: A Speech Act Classifier for Persian Texts and its Application in Identifying RumorsComments: Published Link: this http URLJournalref: Journal of Soft Computing and Information Technology, 9, 1, 1399 (2020), 1827Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [152] arXiv:1902.01542 (replaced) [pdf, other]

Title: Learning Hierarchical Interactions at Scale: A Convex Optimization ApproachComments: AISTATS 2020Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO)
 [153] arXiv:1902.10459 (replaced) [pdf, other]

Title: Data segmentation based on the local intrinsic dimensionComments: 11 pages, 6 figures + 9 pages Supplementary InformationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [154] arXiv:1903.02050 (replaced) [pdf, other]

Title: Revisiting the Evaluation of Uncertainty Estimation and Its Application to Explore Model ComplexityUncertainty TradeOffComments: CVPR 2020  Fair, Data Efficient and Trusted Computer Vision WorkshopSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [155] arXiv:1904.04276 (replaced) [pdf, other]

Title: On nearly assumptionfree tests of nominal confidence interval coverage for causal parameters estimated by machine learningComments: Significant updates from the previous version. In press in Statistical ScienceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
 [156] arXiv:1905.12813 (replaced) [pdf, other]

Title: DataDependent Differentially Private Parameter Learning for Directed Graphical ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [157] arXiv:1906.00042 (replaced) [pdf, other]

Title: Bayesian Profiling Multiple Imputation for Missing Electronic Health RecordsSubjects: Methodology (stat.ME)
 [158] arXiv:1906.04538 (replaced) [pdf, other]

Title: Identification of taxon through fuzzy classificationComments: About half are appendices, which contains mathematical detailsSubjects: Applications (stat.AP); Methodology (stat.ME)
 [159] arXiv:1906.05363 (replaced) [pdf, other]

Title: Competing Bandits in Matching MarketsComments: 15 pages, 3 figures. A version appears in the Proceedings of The 23nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2020Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
 [160] arXiv:1907.00502 (replaced) [pdf, other]

Title: Waveshape oscillatory model for nonstationary periodic time series analysisComments: 40 pages, 15 figuresSubjects: Applications (stat.AP)
 [161] arXiv:1907.04147 (replaced) [pdf, ps, other]

Title: Adaptive inference for a semiparametric generalized autoregressive conditional heteroskedasticity modelSubjects: Methodology (stat.ME); Econometrics (econ.EM)
 [162] arXiv:1907.06734 (replaced) [pdf]

Title: Mediation effects that emulate a target randomised trial: Simulationbased evaluation of illdefined interventions on multiple mediatorsSubjects: Methodology (stat.ME)
 [163] arXiv:1907.11546 (replaced) [pdf, other]

Title: Compressing deep quaternion neural networks with targeted regularizationComments: Published on CAAI Transactions on Intelligence Technology, this https URLSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [164] arXiv:1908.03620 (replaced) [pdf, other]

Title: Learning physicsbased reducedorder models for a singleinjector combustion processJournalref: AIAA Journal 58:6, 26582672, 2020Subjects: Computational Physics (physics.compph); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS); Machine Learning (stat.ML)
 [165] arXiv:1909.02496 (replaced) [pdf, ps, other]

Title: The Benefits of Diversity: Permutation Recovery in Unlabeled Sensing from Multiple Measurement VectorsSubjects: Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [166] arXiv:1909.06039 (replaced) [pdf, other]

Title: dblink: Distributed EndtoEnd Bayesian Entity ResolutionAuthors: Neil G. Marchant, Andee Kaplan, Daniel N. Elazar, Benjamin I. P. Rubinstein, Rebecca C. SteortsComments: 30 pages, 6 figures, 4 tables. Includes 21 pages of supplementary material. This revision includes: updates to the related work, improvements to the clarity of writing and minor updates to the experimental resultsSubjects: Computation (stat.CO); Databases (cs.DB); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [167] arXiv:1909.06389 (replaced) [pdf, other]

Title: Spectral Analysis Of Weighted Laplacians Arising In Data ClusteringSubjects: Spectral Theory (math.SP); Analysis of PDEs (math.AP); Machine Learning (stat.ML)
 [168] arXiv:1909.06677 (replaced) [pdf, other]

Title: Predictive Multiplicity in ClassificationSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
 [169] arXiv:1909.11062 (replaced) [pdf, other]

Title: Wavelet invariants for statistically robust multireference alignmentComments: 59 pages, 8 figures. v3 replaces v2 and is an extensive revision. Revisions include additional background and motivation, additional context relating the approach to other methods, a discussion of stability, and improved presentation. Code reproducing all numerical results is available at this https URLSubjects: Signal Processing (eess.SP); Statistics Theory (math.ST)
 [170] arXiv:1910.00270 (replaced) [pdf, other]

Title: Robust Learning with the HilbertSchmidt Independence CriterionComments: Proceedings of the 37th International Conference on Machine Learning (ICML 2020)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [171] arXiv:1910.02919 (replaced) [pdf, other]

Title: Multistep Greedy Reinforcement Learning AlgorithmsComments: ICML 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [172] arXiv:1910.08442 (replaced) [pdf, ps, other]

Title: CenterOutward REstimation for Semiparametric VARMA ModelsComments: 55 pages, 16 figures, 3 tablesSubjects: Statistics Theory (math.ST)
 [173] arXiv:1910.12327 (replaced) [pdf, ps, other]

Title: A simple measure of conditional dependenceComments: 35 pages, 2 tables. A section on interpreting the coefficient as a generalization of partial R^2 has been added. R package available at this https URLSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Probability (math.PR); Methodology (stat.ME)
 [174] arXiv:1911.00115 (replaced) [pdf, other]

Title: The consequences of checking for zeroinflation and overdispersion in the analysis of count dataAuthors: Harlan CampbellComments: 30 pages, 17 figuresSubjects: Methodology (stat.ME)
 [175] arXiv:1911.02109 (replaced) [pdf, other]

Title: Deep leastsquares methods: an unsupervised learningbased numerical method for solving elliptic PDEsComments: 15 pages, 6 figures, 5 tables, accepted by Journal of Computational PhysicsSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.compph); Machine Learning (stat.ML)
 [176] arXiv:1911.02768 (replaced) [pdf, other]

Title: Confidence Intervals for Policy Evaluation in Adaptive ExperimentsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
 [177] arXiv:1911.09721 (replaced) [pdf, other]

Title: CommunicationEfficient and ByzantineRobust Distributed Learning with Error FeedbackSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
 [178] arXiv:1912.02279 (replaced) [pdf, other]

Title: Angular Visual HardnessAuthors: Beidi Chen, Weiyang Liu, Zhiding Yu, Jan Kautz, Anshumali Shrivastava, Animesh Garg, Anima AnandkumarSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [179] arXiv:1912.07458 (replaced) [pdf, other]

Title: Onmanifold Adversarial Data Augmentation Improves Uncertainty CalibrationComments: changes in appendixSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [180] arXiv:1912.08521 (replaced) [pdf, other]

Title: Semantically Plausible and Diverse 3D Human Motion PredictionSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [181] arXiv:1912.13053 (replaced) [pdf, other]

Title: Disentangling Trainability and Generalization in Deep Neural NetworksComments: 22 pages, 3 figures, ICML 2020. Associated Colab notebook at this https URLSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [182] arXiv:1912.13119 (replaced) [pdf, other]

Title: Clustering and Prediction with Variable Dimension CovariatesSubjects: Methodology (stat.ME)
 [183] arXiv:2001.00102 (replaced) [pdf, other]

Title: The Gambler's Problem and BeyondComments: International Conference on Learning Representations (ICLR) 2020Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
 [184] arXiv:2001.03955 (replaced) [pdf, other]

Title: Aggregated Learning: A VectorQuantization Approach to Learning Neural Network ClassifiersComments: Accepted to AAAI2020.arXiv admin note: text overlap with arXiv:1807.10251Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [185] arXiv:2001.06485 (replaced) [pdf, ps, other]

Title: KNN active learning under local smoothness assumptionComments: arXiv admin note: substantial text overlap with arXiv:1902.03055Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
 [186] arXiv:2001.08950 (replaced) [pdf, other]

Title: PoWERBERT: Accelerating BERT Inference via Progressive Wordvector EliminationAuthors: Saurabh Goyal, Anamitra R. Choudhury, Saurabh M. Raje, Venkatesan T. Chakaravarthy, Yogish Sabharwal, Ashish VermaComments: 11 pages, 8 figures, 4 tablesSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
 [187] arXiv:2002.01328 (replaced) [pdf, other]

Title: TRAP: A Predictive Framework for Trail Running Assessment of PerformanceSubjects: Applications (stat.AP)
 [188] arXiv:2002.03328 (replaced) [pdf, other]

Title: OutofDistribution Detection with Distance Guarantee in Deep Generative ModelsSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [189] arXiv:2002.04108 (replaced) [pdf, other]

Title: Adversarial Filters of Dataset BiasesAuthors: Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew E. Peters, Ashish Sabharwal, Yejin ChoiComments: Accepted to ICML 2020Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
 [190] arXiv:2002.04518 (replaced) [pdf, other]

Title: ConfoundingRobust Policy Evaluation in InfiniteHorizon Reinforcement LearningSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [191] arXiv:2002.04788 (replaced) [pdf, other]

Title: To Split or Not to Split: The Impact of Disparate Treatment in ClassificationSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Information Theory (cs.IT); Machine Learning (stat.ML)
 [192] arXiv:2002.05551 (replaced) [pdf, other]

Title: PACOH: BayesOptimal MetaLearning with PACGuaranteesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [193] arXiv:2002.06836 (replaced) [pdf, other]

Title: Control Frequency Adaptation via Action Persistence in Batch Reinforcement LearningJournalref: Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [194] arXiv:2002.07598 (replaced) [pdf, ps, other]

Title: A confidence interval robust to publication bias for randomeffects metaanalysis of few studiesSubjects: Methodology (stat.ME)
 [195] arXiv:2002.07772 (replaced) [pdf, other]

Title: The Tree Ensemble Layer: Differentiability meets Conditional ComputationComments: ICML 2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [196] arXiv:2002.07836 (replaced) [pdf, ps, other]

Title: Theoretical Convergence of MultiStep ModelAgnostic MetaLearningComments: 40 pagesSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [197] arXiv:2002.08958 (replaced) [pdf, other]

Title: Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal CompressorComments: 22 pages, 6 figures, 2 tablesSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [198] arXiv:2002.11151 (replaced) [pdf, other]

Title: TxSim:Modeling Training of Deep Neural Networks on Resistive Crossbar SystemsSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
 [199] arXiv:2002.11651 (replaced) [pdf, other]

Title: Fair Learning with Private Demographic DataComments: ICML 2020Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
 [200] arXiv:2002.11815 (replaced) [pdf, other]

Title: Uncertainty Quantification for Sparse Deep LearningSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
 [201] arXiv:2003.00295 (replaced) [pdf, other]

Title: Adaptive Federated OptimizationAuthors: Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, H. Brendan McMahanSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [202] arXiv:2003.02460 (replaced) [pdf, other]

Title: A Closer Look at Accuracy vs. RobustnessSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
 [203] arXiv:2003.03241 (replaced) [pdf, other]

Title: Automated detection of corrosion in used nuclear fuel dry storage canisters using residual neural networksAuthors: Theodore Papamarkou, Hayley Guy, Bryce Kroencke, Jordan Miller, Preston Robinette, Daniel Schultz, Jacob Hinkle, Laura Pullum, Catherine Schuman, Jeremy Renshaw, Stylianos ChatzidakisSubjects: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Machine Learning (stat.ML)
 [204] arXiv:2003.03919 (replaced) [pdf, other]

Title: Temporal Attribute Prediction via Joint Modeling of MultiRelational Structure EvolutionComments: In Proceedings of IJCAI 2020. Code can be found at this https URL . The sole copyright holder is IJCAI (International Joint Conferences on Artificial Intelligence), all rights reserved. Original Publication available at this https URLSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [205] arXiv:2003.07070 (replaced) [pdf, other]

Title: Mergesplit Markov chain Monte Carlo for community detectionAuthors: Tiago P. PeixotoComments: 13 pages, 6 figures. Code available at this https URLJournalref: Phys. Rev. E 102, 012305 (2020)Subjects: Physics and Society (physics.socph); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Data Analysis, Statistics and Probability (physics.dataan); Machine Learning (stat.ML)
 [206] arXiv:2003.11194 (replaced) [pdf, ps, other]

Title: A Poisson Kalman filter for disease surveillanceComments: 19 Pages, 8 FiguresSubjects: Methodology (stat.ME); Quantitative Methods (qbio.QM)
 [207] arXiv:2003.11542 (replaced) [pdf, other]

Title: Partial least squares for sparsely observed curves with measurement errorsComments: 42 pages and 3 figuresSubjects: Methodology (stat.ME)
 [208] arXiv:2003.11941 (replaced) [pdf, other]

Title: Validation Set Evaluation can be Wrong: An EvaluatorGenerator Approach for Maximizing Online Performance of Ranking in EcommerceAuthors: Guangda Huzhang, ZhenJia Pang, Yongqing Gao, Yawen Liu, Weijie Shen, WenJi Zhou, Qing Da, AnXiang Zeng, Yang YuSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [209] arXiv:2003.12699 (replaced) [pdf, ps, other]

Title: Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under RealizabilitySubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
 [210] arXiv:2004.00353 (replaced) [pdf, other]

Title: SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable ModelsAuthors: Yucen Luo, Alex Beatson, Mohammad Norouzi, Jun Zhu, David Duvenaud, Ryan P. Adams, Ricky T. Q. ChenComments: ICLR 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [211] arXiv:2004.03391 (replaced) [pdf, other]

Title: Exploiting context dependence for image compression with upsamplingAuthors: Jarek DudaComments: 6 pages, 4 figuresSubjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Multimedia (cs.MM); Machine Learning (stat.ML)
 [212] arXiv:2004.05912 (replaced) [pdf, other]

Title: Towards GANs' Approximation AbilitySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [213] arXiv:2004.05944 (replaced) [pdf, ps, other]

Title: Exact recovery and sharp thresholds of Stochastic Ising Block ModelAuthors: Min YeComments: Corrected some typos. Submitted to IEEE Transactions on Information TheorySubjects: Probability (math.PR); Information Theory (cs.IT); Machine Learning (stat.ML)
 [214] arXiv:2004.06448 (replaced) [pdf, other]

Title: Measurement Error in Nutritional Epidemiology: A SurveyAuthors: Huimin PengSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [215] arXiv:2004.06633 (replaced) [pdf, other]

Title: Occupant Plugload Management for Demand Response in Commercial Buildings: Field Experimentation and Statistical CharacterizationComments: 20 pages, 15 figures, 4 tables, preprintSubjects: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [216] arXiv:2004.08919 (replaced) [pdf, other]

Title: DeepPurpose: a Deep Learning Library for DrugTarget Interaction Prediction and Applications to Repurposing and ScreeningSubjects: Machine Learning (cs.LG); Quantitative Methods (qbio.QM); Machine Learning (stat.ML)
 [217] arXiv:2004.10181 (replaced) [pdf, other]

Title: Knowing what you know: valid and validated confidence sets in multiclass and multilabel predictionComments: Updated section on multilabel settings addressing the cases when labels may repel each otherSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
 [218] arXiv:2005.01814 (replaced) [pdf, other]

Title: Crossvalidation based adaptive sampling for Gaussian process modelsSubjects: Computation (stat.CO)
 [219] arXiv:2005.02532 (replaced) [src]

Title: Statistical errors in Monte Carlobased inference for random elementsAuthors: Yasutaka ShimizuComments: We need to change the discussion drasticallySubjects: Statistics Theory (math.ST)
 [220] arXiv:2005.02979 (replaced) [pdf, ps, other]

Title: A Survey of Algorithms for BlackBox Safety ValidationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Machine Learning (stat.ML)
 [221] arXiv:2005.03899 (replaced) [pdf, other]

Title: Amortized Bayesian Inference for Models of CognitionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [222] arXiv:2005.05080 (replaced) [pdf, other]

Title: Continual Learning Using Multiview Task Conditional Neural NetworksComments: 10 pagesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [223] arXiv:2005.10779 (replaced) [pdf, other]

Title: Using the "Hidden" Genome to Improve Classification of Cancer TypesComments: 24 pages, 4 figures, 2 tablesSubjects: Methodology (stat.ME)
 [224] arXiv:2005.11736 (replaced) [pdf, other]

Title: Efficient Intervention Design for Causal Discovery with LatentsComments: International Conference on Machine Learning 2020Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
 [225] arXiv:2005.12620 (replaced) [pdf, other]

Title: On the Likelihood of Local Projection ModelsAuthors: Masahiro TanakaSubjects: Methodology (stat.ME)
 [226] arXiv:2006.01225 (replaced) [pdf, ps, other]

Title: Streaming Coresets for Symmetric Tensor FactorizationComments: Accepted at ICML 2020. Included algorithm with improved update time and fixed minor bugsSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
 [227] arXiv:2006.03227 (replaced) [pdf, other]

Title: PopulationBased BlackBox Optimization for Biological Sequence DesignAuthors: Christof Angermueller, David Belanger, Andreea Gane, Zelda Mariet, David Dohan, Kevin Murphy, Lucy Colwell, D SculleyJournalref: Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
 [228] arXiv:2006.03745 (replaced) [pdf, other]

Title: Understanding FiniteState Representations of Recurrent Policy NetworksComments: ICML 2020 XXAISubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [229] arXiv:2006.03968 (replaced) [pdf, other]

Title: Generative Design of Hardwareaware DNNsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [230] arXiv:2006.03980 (replaced) [pdf, other]

Title: Fast and Powerful Conditional Randomization Testing via DistillationComments: This paper has been merged with a parallel work arXiv:2006.08482 by Eugene Katsevich and Aaditya RamdasSubjects: Methodology (stat.ME)
 [231] arXiv:2006.04131 (replaced) [pdf, other]

Title: Deep Graph Contrastive Representation LearningComments: Work in progress; updated experimentsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [232] arXiv:2006.04588 (replaced) [pdf, ps, other]

Title: EDCompress: EnergyAware Model Compression for DataflowsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [233] arXiv:2006.05301 (replaced) [pdf, other]

Title: VAEs in the Presence of Missing DataComments: Accepted to ICML Workshop on the Art of Learning with Missing Values (Artemiss), 17 July 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [234] arXiv:2006.06599 (replaced) [pdf, other]

Title: Robust model training and generalisation with Studentising flowsComments: 9 pages, 8 figures, accepted for publication at INNF+ 2020 (Second ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [235] arXiv:2006.07314 (replaced) [pdf, other]

Title: Zerothorder Deterministic Policy GradientComments: 18 pages, 5 figures. Fixed some minor oversights in the theoretical development present in the previous version of the manuscript and significantly revised and expanded the simulations sections, both in the main body and supplementary materialSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [236] arXiv:2006.08482 (replaced) [src]

Title: The leaveonecovariateout conditional randomization testComments: This paper has been withdrawn by the authors, because it has now been merged with (and superseded by) a parallel work arXiv:2006.03980 by Molei Liu and Lucas JansonSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
 [237] arXiv:2006.08684 (replaced) [pdf, other]

Title: Efficient ModelBased Reinforcement Learning through Optimistic Policy Search and PlanningSubjects: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY); Machine Learning (stat.ML)
 [238] arXiv:2006.09396 (replaced) [pdf, other]

Title: Density Deconvolution with Normalizing FlowsComments: Appearing at the second workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models (ICML 2020), Virtual Conference. 8 pages, 6 figures, 5 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [239] arXiv:2006.09635 (replaced) [pdf, other]

Title: Solving Constrained CASH Problems with ADMMAuthors: Parikshit Ram, Sijia Liu, Deepak Vijaykeerthi, Dakuo Wang, Djallel Bouneffouf, Greg Bramble, Horst Samulowitz, Alexander G. GrayComments: 7th ICML Workshop on Automated Machine Learning (2020)Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [240] arXiv:2006.13975 (replaced) [pdf, other]

Title: Estimation and Comparison of Correlationbased Measures of ConcordanceComments: 35 pages, 1 figureSubjects: Statistics Theory (math.ST)
 [241] arXiv:2006.14217 (replaced) [pdf, other]

Title: Stratified stochastic variational inference for highdimensional network factor modelComments: 25 pages, 1 figures. Corrected compilation issues and minor typosSubjects: Computation (stat.CO); Methodology (stat.ME)
 [242] arXiv:2006.14937 (replaced) [pdf, ps, other]

Title: Joints in Random ForestsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [243] arXiv:2006.15107 (replaced) [pdf, other]

Title: Building powerful and equivariant graph neural networks with structural messagepassingComments: Submitted to Neurips 2020. 18 pages, 5 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [244] arXiv:2006.15785 (replaced) [pdf, other]

Title: A NoFreeLunch Theorem for MultiTask LearningSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
 [245] arXiv:2006.15799 (replaced) [src]

Title: ClusterBased Partitioning of Convolutional Neural Networks, A Solution for Computational Energy and Complexity ReductionAuthors: Ali Mirzaeian, Masoud Pourreza, Mohammad Sabokrou, Ashkan Vakil$, Tinoosh Mohsenin, Houman Homayoun, Avesta SasanComments: paper need to be majorly revisedSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [246] arXiv:2006.16193 (replaced) [pdf, other]

Title: Spectral Gap of Replica Exchange Langevin Diffusion on Mixture DistributionsSubjects: Probability (math.PR); Statistics Theory (math.ST)
 [247] arXiv:2007.01231 (replaced) [pdf, other]

Title: Software Engineering Event Modeling using Relative Time in Temporal Knowledge GraphsComments: 11 pages, 1 figure. 37th International Conference on Machine Learning (ICML 2020)  Workshop on Graph Representation Learning and BeyondSubjects: Machine Learning (cs.LG); Software Engineering (cs.SE); Machine Learning (stat.ML)
 [248] arXiv:2007.01285 (replaced) [pdf, other]

Title: Deep Learning for Neuroimagingbased Diagnosis and Rehabilitation of Autism Spectrum Disorder: A ReviewAuthors: Marjane Khodatars, Afshin Shoeibi, Navid Ghassemi, Mahboobeh Jafari, Ali Khadem, Delaram Sadeghi, Parisa Moridian, Sadiq Hussain, Roohallah Alizadehsani, Assef Zare, Abbas Khosravi, Saeid Nahavandi, U. Rajendra Acharya, Michael Berk, Dipti SrinivasanSubjects: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
 [249] arXiv:2007.01888 (replaced) [pdf, other]

Title: Inference on the change point in high dimensional time series models via plug in least squaresSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
 [250] arXiv:2007.03016 (replaced) [pdf]

Title: Multiple Imputation with Massive Data: an Application to the Panel Study of Income DynamicsAuthors: Yajuan Si, Steve Heeringa, David Johnson, Roderick Little, Wenshuo Liu, Fabian Pfeffer, Raghunathan TrivelloreSubjects: Methodology (stat.ME); Applications (stat.AP)
 [251] arXiv:2007.03383 (replaced) [pdf, other]

Title: RGCF: Refined Graph Convolution Collaborative Filtering with concise and expressive embeddingSubjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
 [252] arXiv:2007.04439 (replaced) [pdf, other]

Title: Combining Differentiable PDE Solvers and Graph Neural Networks for Fluid Flow PredictionComments: ICML 2020Subjects: Machine Learning (cs.LG); Computational Physics (physics.compph); Machine Learning (stat.ML)
 [253] arXiv:2007.04728 (replaced) [pdf, other]

Title: Let the Data Choose its Features: Differentiable Unsupervised Feature SelectionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [254] arXiv:2007.05305 (replaced) [pdf, other]

Title: ExpertNet: Adversarial Learning and Recovery Against Noisy LabelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [255] arXiv:2007.05424 (replaced) [pdf, other]

Title: High heritability does not imply accurate prediction under the small additive effects hypothesisAuthors: Arthur Frouin (1), Claire DandineRoulland (1), Morgane PierreJean (1), JeanFrançois Deleuze (1), Christophe Ambroise (2), Edith Le Floch (1) ((1) CNRGH, Institut Jacob, CEA  Université ParisSaclay, (2) LaMME, Université ParisSaclay, CNRS, Université d'Évry val d'Essonne)Subjects: Methodology (stat.ME); Genomics (qbio.GN)
[ showing up to 1000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2007, contact, help (Access key information)