Machine Learning
New submissions
[ showing up to 2000 entries per page: fewer  more ]
New submissions for Thu, 20 Feb 20
 [1] arXiv:2002.07836 [pdf, other]

Title: MultiStep ModelAgnostic MetaLearning: Convergence and Improved AlgorithmsComments: 67 pages, 8 figuresSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
As a popular metalearning approach, the modelagnostic metalearning (MAML) algorithm has been widely used due to its simplicity and effectiveness. However, the convergence of the general multistep MAML still remains unexplored. In this paper, we develop a new theoretical framework, under which we characterize the convergence rate and the computational complexity of multistep MAML. Our results indicate that although the estimation bias and variance of the stochastic meta gradient involve exponential factors of $N$ (the number of the innerstage gradient updates), MAML still attains the convergence with complexity increasing only linearly with $N$ with a properly chosen inner stepsize. We then take a further step to develop a more efficient Hessianfree MAML. We first show that the existing zerothorder Hessian estimator contains a constantlevel estimation error so that the MAML algorithm can perform unstably. To address this issue, we propose a novel Hessian estimator via a gradientbased Gaussian smoothing method, and show that it achieves a much smaller estimation bias and variance, and the resulting algorithm achieves the same performance guarantee as the original MAML under mild conditions. Our experiments validate our theory and demonstrate the effectiveness of the proposed Hessian estimator.
 [2] arXiv:2002.07839 [pdf, other]

Title: Is Local SGD Better than Minibatch SGD?Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan SrebroComments: 29 pagesSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.
 [3] arXiv:2002.07863 [pdf, other]

Title: Learning Similarity Metrics for Numerical SimulationsComments: Main paper: 8 pages, Appendix: 20 pages. Further information at this https URLSubjects: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.dataan); Fluid Dynamics (physics.fludyn); Machine Learning (stat.ML)
We propose a neural networkbased approach that computes a stable and generalizing metric (LSiM), to compare field data from a variety of numerical simulation sources. Our method employs a Siamese network architecture that is motivated by the mathematical properties of a metric. We leverage a controllable data generation setup with partial differential equation (PDE) solvers to create increasingly different outputs from a reference simulation in a controlled environment. A central component of our learned metric is a specialized loss function that introduces knowledge about the correlation between single data samples into the training process. To demonstrate that the proposed approach outperforms existing simple metrics for vector spaces and other learned, imagebased metrics, we evaluate the different methods on a large range of test data. Additionally, we analyze benefits for generalization and the impact of an adjustable training data difficulty. The robustness of LSiM is demonstrated via an evaluation on three realworld data sets.
 [4] arXiv:2002.07867 [pdf, other]

Title: Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal TopologySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
A recent line of research has provided convergence guarantees for gradient descent algorithms in the excessive overparameterization regime where the widths of all the hidden layers are required to be polynomially large in the number of training samples. However, the widths of practical deep networks are often only large in the first layer(s) and then start to decrease towards the output layer. This raises an interesting open question whether similar results also hold under this empirically relevant setting. Existing theoretical insights suggest that the loss surface of this class of networks is wellbehaved, but these results usually do not provide direct algorithmic guarantees for optimization. In this paper, we close the gap by showing that one wide layer followed by pyramidal deep network topology suffices for gradient descent to find a global minimum with a geometric rate. Our proof is based on a weak form of PolyakLojasiewicz inequality which holds for deep pyramidal networks in the manifold of fullrank weight matrices.
 [5] arXiv:2002.07891 [pdf, other]

Title: Towards QueryEfficient BlackBox Adversary with ZerothOrder Natural Gradient DescentComments: accepted by AAAI 2020Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Despite the great achievements of the modern deep neural networks (DNNs), the vulnerability/robustness of stateoftheart DNNs raises security concerns in many application domains requiring high reliability. Various adversarial attacks are proposed to sabotage the learning performance of DNN models. Among those, the blackbox adversarial attack methods have received special attentions owing to their practicality and simplicity. Blackbox attacks usually prefer less queries in order to maintain stealthy and low costs. However, most of the current blackbox attack methods adopt the firstorder gradient descent method, which may come with certain deficiencies such as relatively slow convergence and high sensitivity to hyperparameter settings. In this paper, we propose a zerothorder natural gradient descent (ZONGD) method to design the adversarial attacks, which incorporates the zerothorder gradient estimation technique catering to the blackbox attack scenario and the secondorder natural gradient descent to achieve higher query efficiency. The empirical evaluations on image classification datasets demonstrate that ZONGD can obtain significantly lower model query complexities compared with stateoftheart attack methods.
 [6] arXiv:2002.07898 [pdf, other]

Title: Deep Transform and Metric Learning Network: Wedding Deep Dictionary Learning and Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
On account of its many successes in inference tasks and denoising applications, Dictionary Learning (DL) and its related sparse optimization problems have garnered a lot of research interest. While most solutions have focused on single layer dictionaries, the improved recently proposed Deep DL (DDL) methods have also fallen short on a number of issues. We propose herein, a novel DDL approach where each DL layer can be formulated as a combination of one linear layer and a Recurrent Neural Network (RNN). The RNN is shown to flexibly account for the layerassociated and learned metric. Our proposed work unveils new insights into Neural Networks and DDL and provides a new, efficient and competitive approach to jointly learn a deep transform and a metric for inference applications. Extensive experiments are carried out to demonstrate that the proposed method can not only outperform existing DDL but also stateoftheart generic CNNs.
 [7] arXiv:2002.07905 [pdf, other]

Title: Empirical Policy Evaluation with SupergraphsSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We devise and analyze algorithms for the empirical policy evaluation problem in reinforcement learning. Our algorithms explore backward from highcost states to find highvalue ones, in contrast to forward approaches that work forward from all states. While several papers have demonstrated the utility of backward exploration empirically, we conduct rigorous analyses which show that our algorithms can reduce averagecase sample complexity from $O(S \log S)$ to as low as $O(\log S)$.
 [8] arXiv:2002.07906 [pdf, other]

Title: CAUSE: Learning Granger Causality from Event Sequences using Attribution MethodsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study the problem of learning Granger causality between event types from asynchronous, interdependent, multitype event sequences. Existing work suffers from either limited model flexibility or poor model explainability and thus fails to uncover Granger causality across a wide variety of event sequences with diverse event interdependency. To address these weaknesses, we propose CAUSE (Causality from AttribUtions on Sequence of Events), a novel framework for the studied task. The key idea of CAUSE is to first implicitly capture the underlying event interdependency by fitting a neural point process, and then extract from the process a Granger causality statistic using an axiomatic attribution method. Across multiple datasets riddled with diverse event interdependency, we demonstrate that CAUSE achieves superior performance on correctly inferring the intertype Granger causality over a range of stateoftheart methods.
 [9] arXiv:2002.07911 [pdf, other]

Title: Generating Automatic Curricula via SelfSupervised Active Domain RandomizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
Goaldirected Reinforcement Learning (RL) traditionally considers an agent interacting with an environment, prescribing a realvalued reward to an agent proportional to the completion of some goal. Goaldirected RL has seen large gains in sample efficiency, due to the ease of reusing or generating new experience by proposing goals. In this work, we build on the framework of selfplay, allowing an agent to interact with itself in order to make progress on some unknown task. We use Active Domain Randomization and selfplay to create a novel, coupled environmentgoal curriculum, where agents learn through progressively more difficult tasks and environment variations. Our method, SelfSupervised Active Domain Randomization (SSADR), generates a growing curriculum, encouraging the agent to try tasks that are just outside of its current capabilities, while building a domainrandomization curriculum that enables stateoftheart results on various sim2real transfer tasks. Our results show that a curriculum of coevolving the environment difficulty along with the difficulty of goals set in each environment provides practical benefits in the goaldirected tasks tested.
 [10] arXiv:2002.07916 [pdf, other]

Title: Information Condensing Active LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce Information Condensing Active Learning (ICAL), a batch mode model agnostic Active Learning (AL) method targeted at Deep Bayesian Active Learning that focuses on acquiring labels for points which have as much information as possible about the still unacquired points. ICAL uses the Hilbert Schmidt Independence Criterion (HSIC) to measure the strength of the dependency between a candidate batch of points and the unlabeled set. We develop key optimizations that allow us to scale our method to large unlabeled sets. We show significant improvements in terms of model accuracy and negative log likelihood (NLL) on several image datasets compared to state of the art batch mode AL methods for deep learning.
 [11] arXiv:2002.07920 [pdf, other]

Title: Block Switching: A Stochastic Approach for Deep Learning SecurityComments: Accepted by AdvML19: Workshop on Adversarial Learning Methods for Machine Learning and Data Mining at KDD, Anchorage, Alaska, USA, August 5th, 2019, 5 pagesSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Recent study of adversarial attacks has revealed the vulnerability of modern deep learning models. That is, subtly crafted perturbations of the input can make a trained network with high accuracy produce arbitrary incorrect predictions, while maintain imperceptible to human vision system. In this paper, we introduce Block Switching (BS), a defense strategy against adversarial attacks based on stochasticity. BS replaces a block of model layers with multiple parallel channels, and the active channel is randomly assigned in the run time hence unpredictable to the adversary. We show empirically that BS leads to a more dispersed input gradient distribution and superior defense effectiveness compared with other stochastic defenses such as stochastic activation pruning (SAP). Compared to other defenses, BS is also characterized by the following features: (i) BS causes less test accuracy drop; (ii) BS is attackindependent and (iii) BS is compatible with other defenses and can be used jointly with others.
 [12] arXiv:2002.07922 [pdf, other]

Title: ShortTerm Traffic Flow Prediction Using Variational LSTM NetworksComments: 18 pages, 13 figuresSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP)
Traffic flow characteristics are one of the most critical decisionmaking and traffic policing factors in a region. Awareness of the predicted status of the traffic flow has prime importance in traffic management and traffic information divisions. The purpose of this research is to suggest a forecasting model for traffic flow by using deep learning techniques based on historical data in the Intelligent Transportation Systems area. The historical data collected from the Caltrans Performance Measurement Systems (PeMS) for six months in 2019. The proposed prediction model is a Variational Long ShortTerm Memory Encoder in brief VLSTME try to estimate the flow accurately in contrast to other conventional methods. VLSTME can provide more reliable shortterm traffic flow by considering the distribution and missing values.
 [13] arXiv:2002.07933 [pdf, other]

Title: Improving Generalization by Controlling LabelNoise Information in Neural Network WeightsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In the presence of noisy or incorrect labels, neural networks have the undesirable tendency to memorize information about the noise. Standard regularization techniques such as dropout, weight decay or data augmentation sometimes help, but do not prevent this behavior. If one considers neural network weights as random variables that depend on the data and stochasticity of training, the amount of memorized information can be quantified with the Shannon mutual information between weights and the vector of all training labels given inputs, $I(w : \mathbf{y} \mid \mathbf{x})$. We show that for any training algorithm, low values of this term correspond to reduction in memorization of labelnoise and better generalization bounds. To obtain these low values, we propose training algorithms that employ an auxiliary network that predicts gradients in the final layers of a classifier without accessing labels. We illustrate the effectiveness of our approach on versions of MNIST, CIFAR10, and CIFAR100 corrupted with various noise models, and on a largescale dataset Clothing1M that has noisy labels.
 [14] arXiv:2002.07942 [pdf, other]

Title: Source Separation with Deep Generative PriorsComments: 18 pages, 15 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Despite substantial progress in signal source separation, results for richly structured data continue to contain perceptible artifacts. In contrast, recent deep generative models can produce authentic samples in a variety of domains that are indistinguishable from samples of the data distribution. This paper introduces a Bayesian approach to source separation that uses generative models as priors over the components of a mixture of sources, and Langevin dynamics to sample from the posterior distribution of sources given a mixture. This decouples the source separation problem from generative modeling, enabling us to directly use cuttingedge generative models as priors. The method achieves stateoftheart performance for MNIST digit separation. We introduce new methodology for evaluating separation quality on richer datasets, providing quantitative evaluation of separation results on CIFAR10. We also provide qualitative results on LSUN.
 [15] arXiv:2002.07948 [pdf, other]

Title: Personalized Federated Learning: A MetaLearning ApproachSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
The goal of federated learning is to design algorithms in which several agents communicate with a central node, in a privacyprotecting manner, to minimize the average of their loss functions. In this approach, each node not only shares the required computational budget but also has access to a larger data set, which improves the quality of the resulting model. However, this method only develops a common output for all the agents, and therefore, does not adapt the model to each user data. This is an important missing feature especially given the heterogeneity of the underlying data distribution for various agents. In this paper, we study a personalized variant of the federated learning in which our goal is to find a shared initial model in a distributed manner that can be slightly updated by either a current or a new user by performing one or a few steps of gradient descent with respect to its own loss function. This approach keeps all the benefits of the federated learning architecture while leading to a more personalized model for each user. We show this problem can be studied within the ModelAgnostic MetaLearning (MAML) framework. Inspired by this connection, we propose a personalized variant of the wellknown Federated Averaging algorithm and evaluate its performance in terms of gradient norm for nonconvex loss functions. Further, we characterize how this performance is affected by the closeness of underlying distributions of user data, measured in terms of distribution distances such as Total Variation and 1Wasserstein metric.
 [16] arXiv:2002.07956 [pdf, other]

Title: Curriculum in GradientBased MetaReinforcement LearningComments: 11 pages, 10 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Gradientbased metalearners such as ModelAgnostic MetaLearning (MAML) have shown strong fewshot performance in supervised and reinforcement learning settings. However, specifically in the case of metareinforcement learning (metaRL), we can show that gradientbased metalearners are sensitive to task distributions. With the wrong curriculum, agents suffer the effects of metaoverfitting, shallow adaptation, and adaptation instability. In this work, we begin by highlighting intriguing failure cases of gradientbased metaRL and show that task distributions can wildly affect algorithmic outputs, stability, and performance. To address this problem, we leverage insights from recent literature on domain randomization and propose meta Active Domain Randomization (metaADR), which learns a curriculum of tasks for gradientbased metaRL in a similar as ADR does for sim2real transfer. We show that this approach induces more stable policies on a variety of simulated locomotion and navigation tasks. We assess in and outofdistribution generalization and find that the learned task distributions, even in an unstructured task space, greatly improve the adaptation performance of MAML. Finally, we motivate the need for better benchmarking in metaRL that prioritizes \textit{generalization} over singletask adaption performance.
 [17] arXiv:2002.07962 [pdf, other]

Title: Inductive Representation Learning on Temporal GraphsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Inductive representation learning on temporal graphs is an important step toward salable machine learning on realworld dynamic networks. The evolving nature of temporal dynamic graphs requires handling new nodes as well as capturing temporal patterns. The node embeddings, which are now functions of time, should represent both the static node features and the evolving topological structures. Moreover, node and topological features can be temporal as well, whose patterns the node embeddings should also capture. We propose the temporal graph attention (TGAT) layer to efficiently aggregate temporaltopological neighborhood features as well as to learn the timefeature interactions. For TGAT, we use the selfattention mechanism as building block and develop a novel functional time encoding technique based on the classical Bochner's theorem from harmonic analysis. By stacking TGAT layers, the network recognizes the node embeddings as functions of time and is able to inductively infer embeddings for both new and observed nodes as the graph evolves. The proposed approach handles both node classification and link prediction task, and can be naturally extended to include the temporal edge features. We evaluate our method with transductive and inductive tasks under temporal settings with two benchmark and one industrial dataset. Our TGAT model compares favorably to stateoftheart baselines as well as the previous temporal graph embedding approaches.
 [18] arXiv:2002.07965 [pdf, other]

Title: Being Bayesian about Categorical ProbabilitySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural networks utilize the softmax as a building block in classification tasks, which contains an overconfidence problem and lacks an uncertainty representation ability. As a Bayesian alternative to the softmax, we consider a random variable of a categorical probability over class labels. In this framework, the prior distribution explicitly models the presumed noise inherent in the observed label, which provides consistent gains in generalization performance in multiple challenging tasks. The proposed method inherits advantages of Bayesian approaches that achieve better uncertainty estimation and model calibration. Our method can be implemented as a plugandplay loss function with negligible computational overhead compared to the softmax with the crossentropy loss function.
 [19] arXiv:2002.07971 [pdf, other]

Title: Gradient Boosting Neural Networks: GrowNetSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
A novel gradient boosting framework is proposed where shallow neural networks are employed as "weak learners". General loss functions are considered under this unified framework with specific examples presented for classification, regression and learning to rank. A fully corrective step is incorporated to remedy the pitfall of greedy function approximation of classic gradient boosting decision tree. The proposed model rendered stateoftheart results in all three tasks on multiple datasets. An ablation study is performed to shed light on the effect of each model components and model hyperparameters.
 [20] arXiv:2002.07994 [pdf, other]

Title: Bestitem Learning in Random Utility Models with Subset ChoicesComments: Accepted to 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), 2020Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We consider the problem of PAC learning the most valuable item from a pool of $n$ items using sequential, adaptively chosen plays of subsets of $k$ items, when, upon playing a subset, the learner receives relative feedback sampled according to a general Random Utility Model (RUM) with independent noise perturbations to the latent item utilities. We identify a new property of such a RUM, termed the minimum advantage, that helps in characterizing the complexity of separating pairs of items based on their relative win/loss empirical counts, and can be bounded as a function of the noise distribution alone. We give a learning algorithm for general RUMs, based on pairwise relative counts of items and hierarchical elimination, along with a new PAC sample complexity guarantee of $O(\frac{n}{c^2\epsilon^2} \log \frac{k}{\delta})$ rounds to identify an $\epsilon$optimal item with confidence $1\delta$, when the worst case pairwise advantage in the RUM has sensitivity at least $c$ to the parameter gaps of items. Fundamental lower bounds on PAC sample complexity show that this is nearoptimal in terms of its dependence on $n,k$ and $c$.
 [21] arXiv:2002.08000 [pdf, other]

Title: ActionManipulation Attacks Against Stochastic Bandits: Attacks and DefenseComments: 13 pages, 7 figures, submitted to IEEE Transaction on Signal ProcessingSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Optimization and Control (math.OC); Machine Learning (stat.ML)
Due to the broad range of applications of stochastic multiarmed bandit model, understanding the effects of adversarial attacks and designing bandit algorithms robust to attacks are essential for the safe applications of this model. In this paper, we introduce a new class of attack named actionmanipulation attack. In this attack, an adversary can change the action signal selected by the user. We show that without knowledge of mean rewards of arms, our proposed attack can manipulate Upper Confidence Bound (UCB) algorithm, a widely used bandit algorithm, into pulling a target arm very frequently by spending only logarithmic cost. To defend against this class of attacks, we introduce a novel algorithm that is robust to actionmanipulation attacks when an upper bound for the total attack cost is given. We prove that our algorithm has a pseudoregret upper bounded by $\mathcal{O}(\max\{\log T,A\})$, where $T$ is the total number of rounds and $A$ is the upper bound of the total attack cost.
 [22] arXiv:2002.08012 [pdf, other]

Title: Indirect Adversarial Attacks via Poisoning Neighbors for Graph Convolutional NetworksAuthors: Tsubasa TakahashiComments: Accepted in IEEE BigData 2019Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Graph convolutional neural networks, which learn aggregations over neighbor nodes, have achieved great performance in node classification tasks. However, recent studies reported that such graph convolutional node classifier can be deceived by adversarial perturbations on graphs. Abusing graph convolutions, a node's classification result can be influenced by poisoning its neighbors. Given an attributed graph and a node classifier, how can we evaluate robustness against such indirect adversarial attacks? Can we generate strong adversarial perturbations which are effective on not only onehop neighbors, but more far from the target? In this paper, we demonstrate that the node classifier can be deceived with highconfidence by poisoning just a single node even twohops or more far from the target. Towards achieving the attack, we propose a new approach which searches smaller perturbations on just a single node far from the target. In our experiments, our proposed method shows 99% attack success rate within twohops from the target in two datasets. We also demonstrate that mlayer graph convolutional neural networks have chance to be deceived by our indirect attack within mhop neighbors. The proposed attack can be used as a benchmark in future defense attempts to develop graph convolutional neural networks with having adversary robustness.
 [23] arXiv:2002.08032 [pdf, other]

Title: A Fixed point view: A ModelBased Clustering FrameworkComments: 10 pages, 2 figuresSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
With the inflation of the data, clustering analysis, as a branch of unsupervised learning, lacks unified understanding and application of its mathematical law. Based on the view of fixed point, this paper restates the modelbased clustering and proposes a unified clustering framework. In order to find fixed points as cluster centers, the framework iteratively constructs the contraction map, which strongly reveals the convergence mechanism and interconnections among algorithms. By specifying a contraction map, Gaussian mixture model (GMM) can be mapped to the framework as an application. We hope the fixed point framework will help the design of future clustering algorithms.
 [24] arXiv:2002.08037 [pdf, other]

Title: Efficient Deep Reinforcement Learning through Policy TransferAuthors: Tianpei Yang, Jianye Hao, Zhaopeng Meng, Zongzhang Zhang, Weixun Wang, Yujing Hu, Yingfeng Cheng, Changjie Fan, Zhaodong Wang, Jiajie PengComments: Accepted by AAMAS'2020 as an EXTENDED ABSTRACTSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Transfer Learning (TL) has shown great potential to accelerate Reinforcement Learning (RL) by leveraging prior knowledge from past learned policies of relevant tasks. Existing transfer approaches either explicitly computes the similarity between tasks or select appropriate source policies to provide guided explorations for the target task. However, how to directly optimize the target policy by alternatively utilizing knowledge from appropriate source policies without explicitly measuring the similarity is currently missing. In this paper, we propose a novel Policy Transfer Framework (PTF) to accelerate RL by taking advantage of this idea. Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it by modeling multipolicy transfer as the option learning problem. PTF can be easily combined with existing deep RL approaches. Experimental results show it significantly accelerates the learning process and surpasses stateoftheart policy transfer methods in terms of learning efficiency and final performance in both discrete and continuous action spaces.
 [25] arXiv:2002.08041 [pdf, other]

Title: Enlarging Discriminative Power by Adding an Extra Class in Unsupervised Domain AdaptationSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
In this paper, we study the problem of unsupervised domain adaptation that aims at obtaining a prediction model for the target domain using labeled data from the source domain and unlabeled data from the target domain. There exists an array of recent research based on the idea of extracting features that are not only invariant for both domains but also provide high discriminative power for the target domain. In this paper, we propose an idea of empowering the discriminativeness: Adding a new, artificial class and training the model on the data together with the GANgenerated samples of the new class. The trained model based on the new class samples is capable of extracting the features that are more discriminative by repositioning data of current classes in the target domain and therefore drawing the decision boundaries more effectively. Our idea is highly generic so that it is compatible with many existing methods such as DANN, VADA, and DIRTT. We conduct various experiments for the standard data commonly used for the evaluation of unsupervised domain adaptations and demonstrate that our algorithm achieves the SOTA performance for many scenarios.
 [26] arXiv:2002.08053 [pdf, other]

Title: Progressive Identification of True Labels for PartialLabel LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Partiallabel learning is one of the important weakly supervised learning problems, where each training example is equipped with a set of candidate labels that contains the true label. Most existing methods elaborately designed learning objectives as constrained optimizations that must be solved in specific manners, making their computational complexity a bottleneck for scaling up to big data. The goal of this paper is to propose a novel framework of partiallabel learning without implicit assumptions on the model or optimization algorithm. More specifically, we propose a general estimator of the classification risk, theoretically analyze the classifierconsistency, and establish an estimation error bound. We then explore a progressive identification method for approximately minimizing the proposed risk estimator, where the update of the model and identification of true labels are conducted in a seamless manner. The resulting algorithm is modelindependent and lossindependent, and compatible with stochastic optimization. Thorough experiments demonstrate it sets the new state of the art.
 [27] arXiv:2002.08056 [pdf, other]

Title: The Geometry of Sign Gradient DescentSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Signbased optimization methods have become popular in machine learning due to their favorable communication cost in distributed optimization and their surprisingly good performance in neural network training. Furthermore, they are closely connected to socalled adaptive gradient methods like Adam. Recent works on signSGD have used a nonstandard "separable smoothness" assumption, whereas some older works study sign gradient descent as steepest descent with respect to the $\ell_\infty$norm. In this work, we unify these existing results by showing a close connection between separable smoothness and $\ell_\infty$smoothness and argue that the latter is the weaker and more natural assumption. We then proceed to study the smoothness constant with respect to the $\ell_\infty$norm and thereby isolate geometric properties of the objective function which affect the performance of signbased methods. In short, we find signbased methods to be preferable over gradient descent if (i) the Hessian is to some degree concentrated on its diagonal, and (ii) its maximal eigenvalue is much larger than the average eigenvalue. Both properties are common in deep networks.
 [28] arXiv:2002.08071 [pdf, other]

Title: Dissecting Neural ODEsSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
Continuous deep learning architectures have recently reemerged as variants of Neural Ordinary Differential Equations (Neural ODEs). The infinitedepth approach offered by these models theoretically bridges the gap between deep learning and dynamical systems; however, deciphering their inner working is still an open challenge and most of their applications are currently limited to the inclusion as generic blackbox modules. In this work, we "open the box" and offer a systemtheoretic perspective, including state augmentation strategies and robustness, with the aim of clarifying the influence of several design choices on the underlying dynamics. We also introduce novel architectures: among them, a Galerkininspired depthvarying parameter model and neural ODEs with datacontrolled vector fields.
 [29] arXiv:2002.08095 [pdf, ps, other]

Title: Logarithmic Regret for Learning Linear Quadratic Regulators EfficientlyAuthors: Asaf Cassel (1), Alon Cohen (2), Tomer Koren (1) ((1) School of Computer Science, Tel Aviv University, (2) Google Research, Tel Aviv)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We consider the problem of learning in Linear Quadratic Control systems whose transition parameters are initially unknown. Recent results in this setting have demonstrated efficient learning algorithms with regret growing with the square root of the number of decision steps. We present new efficient algorithms that achieve, perhaps surprisingly, regret that scales only (poly)logarithmically with the number of steps in two scenarios: when only the state transition matrix $A$ is unknown, and when only the stateaction transition matrix $B$ is unknown and the optimal policy satisfies a certain nondegeneracy condition. On the other hand, we give a lower bound that shows that when the latter condition is violated, square root regret is unavoidable.
 [30] arXiv:2002.08104 [pdf, other]

Title: Neural Networks on Random GraphsSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
We performed a massive evaluation of neural networks with architectures corresponding to random graphs of various types. Apart from the classical random graph families including random, scalefree and small world graphs, we introduced a novel and flexible algorithm for directly generating random directed acyclic graphs (DAG) and studied a class of graphs derived from functional resting state fMRI networks. A majority of the best performing networks were indeed in these new families. We also proposed a general procedure for turning a graph into a DAG necessary for a feedforward neural network. We investigated various structural and numerical properties of the graphs in relation to neural network test accuracy. Since none of the classical numerical graph invariants by itself seems to allow to single out the best networks, we introduced new numerical characteristics that selected a set of quasi1dimensional graphs, which were the majority among the best performing networks.
 [31] arXiv:2002.08111 [pdf, other]

Title: Hierarchical Quantized AutoencodersSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
Despite progress in training neural networks for lossy image compression, current approaches fail to maintain both perceptual quality and highlevel features at very low bitrates. Encouraged by recent success in learning discrete representations with Vector Quantized Variational AutoEncoders (VQVAEs), we motivate the use of a hierarchy of VQVAEs to attain high factors of compression. We show that the combination of quantization and hierarchical latent structure aids likelihoodbased image compression. This leads us to introduce a more probabilistic framing of the VQVAE, of which previous work is a limiting case. Our hierarchy produces a Markovian series of latent variables that reconstruct highquality images which retain semantically meaningful features. These latents can then be further used to generate realistic samples. We provide qualitative and quantitative evaluations of reconstructions and samples on the CelebA and MNIST datasets.
 [32] arXiv:2002.08118 [pdf, other]

Title: Randomized Smoothing of All Shapes and SizesComments: 9 pages main text, 40 pages totalSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
Randomized smoothing is a recently proposed defense against adversarial attacks that has achieved stateoftheart provable robustness against $\ell_2$ perturbations. Soon after, a number of works devised new randomized smoothing schemes for other metrics, such as $\ell_1$ or $\ell_\infty$; however, for each geometry, substantial effort was needed to derive new robustness guarantees. This begs the question: can we find a general theory for randomized smoothing?
In this work we propose a novel framework for devising and analyzing randomized smoothing schemes, and validate its effectiveness in practice. Our theoretical contributions are as follows: (1) We show that for an appropriate notion of "optimal", the optimal smoothing distributions for any "nice" norm have level sets given by the *Wulff Crystal* of that norm. (2) We propose two novel and complementary methods for deriving provably robust radii for any smoothing distribution. Finally, (3) we show fundamental limits to current randomized smoothing techniques via the theory of *Banach space cotypes*. By combining (1) and (2), we significantly improve the stateoftheart certified accuracy in $\ell_1$ on standard datasets. On the other hand, using (3), we show that, without more information than label statistics under random input perturbations, randomized smoothing cannot achieve nontrivial certified accuracy against perturbations of $\ell_\infty$norm $\Omega(1/\sqrt d)$, when the input dimension $d$ is large. We provide code in github.com/tonyduan/rs4a.  [33] arXiv:2002.08125 [pdf, other]

Title: GradientAdjusted Neuron Activation Profiles for Comprehensive Introspection of Convolutional Speech Recognition ModelsSubjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
Deep Learning based Automatic Speech Recognition (ASR) models are very successful, but hard to interpret. To gain better understanding of how Artificial Neural Networks (ANNs) accomplish their tasks, introspection methods have been proposed. Adapting such techniques from computer vision to speech recognition is not straightforward, because speech data is more complex and less interpretable than image data. In this work, we introduce Gradientadjusted Neuron Activation Profiles (GradNAPs) as means to interpret features and representations in Deep Neural Networks. GradNAPs are characteristic responses of ANNs to particular groups of inputs, which incorporate the relevance of neurons for prediction. We show how to utilize GradNAPs to gain insight about how data is processed in ANNs. This includes different ways of visualizing features and clustering of GradNAPs to compare embeddings of different groups of inputs in any layer of a given network. We demonstrate our proposed techniques using a fullyconvolutional ASR model.
 [34] arXiv:2002.08165 [pdf, other]

Title: Using Hindsight to Anchor Past Knowledge in Continual LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In continual learning, the learner faces a stream of data whose distribution changes over time. Modern neural networks are known to suffer under this setting, as they quickly forget previously acquired knowledge. To address such catastrophic forgetting, many continual learning methods implement different types of experience replay, relearning on past data stored in a small buffer known as episodic memory. In this work, we complement experience replay with a new objective that we call anchoring, where the learner uses bilevel optimization to update its knowledge on the current task, while keeping intact the predictions on some anchor points of past tasks. These anchor points are learned using gradientbased optimization to maximize forgetting, which is approximated by finetuning the currently trained model on the episodic memory of past tasks. Experiments on several supervised learning benchmarks for continual learning demonstrate that our approach improves the standard experience replay in terms of both accuracy and forgetting metrics and for various sizes of episodic memories.
 [35] arXiv:2002.08196 [pdf, other]

Title: Federated Learning in the Sky: Joint Power Allocation and Scheduling with UAV SwarmsComments: 8 pages, 4 figuresSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Robotics (cs.RO); Signal Processing (eess.SP); Machine Learning (stat.ML)
Unmanned aerial vehicle (UAV) swarms must exploit machine learning (ML) in order to execute various tasks ranging from coordinated trajectory planning to cooperative target recognition. However, due to the lack of continuous connections between the UAV swarm and ground base stations (BSs), using centralized ML will be challenging, particularly when dealing with a large volume of data. In this paper, a novel framework is proposed to implement distributed federated learning (FL) algorithms within a UAV swarm that consists of a leading UAV and several following UAVs. Each following UAV trains a local FL model based on its collected data and then sends this trained local model to the leading UAV who will aggregate the received models, generate a global FL model, and transmit it to followers over the intraswarm network. To identify how wireless factors, like fading, transmission delay, and UAV antenna angle deviations resulting from wind and mechanical vibrations, impact the performance of FL, a rigorous convergence analysis for FL is performed. Then, a joint power allocation and scheduling design is proposed to optimize the convergence rate of FL while taking into account the energy consumption during convergence and the delay requirement imposed by the swarm's control system. Simulation results validate the effectiveness of the FL convergence analysis and show that the joint design strategy can reduce the number of communication rounds needed for convergence by as much as 35% compared with the baseline design.
 [36] arXiv:2002.08204 [pdf]

Title: SYMOG: learning symmetric mixture of Gaussian modes for improved fixedpoint quantizationComments: Preprint submitted to NeurocomputingSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Deep neural networks (DNNs) have been proven to outperform classical methods on several machine learning benchmarks. However, they have high computational complexity and require powerful processing units. Especially when deployed on embedded systems, model size and inference time must be significantly reduced. We propose SYMOG (symmetric mixture of Gaussian modes), which significantly decreases the complexity of DNNs through lowbit fixedpoint quantization. SYMOG is a novel soft quantization method such that the learning task and the quantization are solved simultaneously. During training the weight distribution changes from an unimodal Gaussian distribution to a symmetric mixture of Gaussians, where each mean value belongs to a particular fixedpoint mode. We evaluate our approach with different architectures (LeNet5, VGG7, VGG11, DenseNet) on common benchmark data sets (MNIST, CIFAR10, CIFAR100) and we compare with stateoftheart quantization approaches. We achieve excellent results and outperform 2bit stateoftheart performance with an error rate of only 5.71% on CIFAR10 and 27.65% on CIFAR100.
 [37] arXiv:2002.08224 [pdf, other]

Title: A Survey on Predictive Maintenance for Industry 4.0Authors: Christian Krupitzer (1), Tim Wagenhals (2), Marwin Züfle (1), Veronika Lesch (1), Dominik Schäfer (3), Amin Mozaffarin (4), Janick Edinger (2), Christian Becker (2), Samuel Kounev (1) ((1) University of Würzburg, Würzburg, Germany, (2) University of Mannheim, Mannheim, Germany, (3) Syntax Systems GmbH, Weinheim, Germany, (4) MOZYS Engineering GmbH, Würzburg)Subjects: Machine Learning (cs.LG)
Production issues at Volkswagen in 2016 lead to dramatic losses in sales of up to 400 million Euros per week. This example shows the huge financial impact of a working production facility for companies. Especially in the datadriven domains of Industry 4.0 and Industrial IoT with intelligent, connected machines, a conventional, static maintenance schedule seems to be oldfashioned. In this paper, we present a survey on the current state of the art in predictive maintenance for Industry 4.0. Based on a structured literate survey, we present a classification of predictive maintenance in the context of Industry 4.0 and discuss recent developments in this area.
 [38] arXiv:2002.08243 [pdf, ps, other]

Title: Optimistic Policy Optimization with Bandit FeedbackComments: 34 pagesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. Yet, so far, such methods have been mostly analyzed from an optimization perspective, without addressing the problem of exploration, or by making strong assumptions on the interaction with the environment. In this paper we consider modelbased RL in the tabular finitehorizon MDP setting with unknown transitions and bandit feedback. For this setting, we propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $\tilde O(\sqrt{S^2 A H^4 K})$ regret for stochastic rewards. Furthermore, we prove $\tilde O( \sqrt{ S^2 A H^4 } K^{2/3} ) $ regret for adversarial rewards. Interestingly, this result matches previous bounds derived for the bandit feedback case, yet with known transitions. To the best of our knowledge, the two results are the first sublinear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
 [39] arXiv:2002.08247 [pdf, other]

Title: Learning Global Transparent Models from Local Contrastive ExplanationsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
There is a rich and growing literature on producing local point wise contrastive/counterfactual explanations for complex models. These methods highlight what is important to justify the classification and/or produce a contrast point that alters the final classification. Other works try to build globally interpretable models like decision trees and rule lists directly by efficient model search using the data or by transferring information from a complex model using distillationlike methods. Although these interpretable global models can be useful, they may not be consistent with local explanations from a specific complex model of choice. In this work, we explore the question: Can we produce a transparent global model that is consistent with/derivable from local explanations? Based on a key insight we provide a novel method where every local contrastive/counterfactual explanation can be turned into a Boolean feature. These Boolean features are sparse conjunctions of binarized features. The dataset thus constructed is consistent with local explanations by design and one can train an interpretable model like a decision tree on it. We note that this approach strictly loses information due to reliance only on sparse local explanations, nonetheless, we demonstrate empirically that in many cases it can still be competitive with respect to the complex model's performance and also other methods that learn directly from the original dataset. Our approach also provides an avenue to benchmark local explanation methods in a quantitative manner.
 [40] arXiv:2002.08258 [pdf, ps, other]

Title: Knapsack Pruning with Inner DistillationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural network pruning reduces the computational cost of an overparameterized network to improve its efficiency. Popular methods vary from $\ell_1$norm sparsification to Neural Architecture Search (NAS). In this work, we propose a novel pruning method that optimizes the final accuracy of the pruned network and distills knowledge from the overparameterized parent network's inner layers. To enable this approach, we formulate the network pruning as a Knapsack Problem which optimizes the tradeoff between the importance of neurons and their associated computational cost. Then we prune the network channels while maintaining the highlevel structure of the network. The pruned network is finetuned under the supervision of the parent network using its inner network knowledge, a technique we refer to as the Inner Knowledge Distillation. Our method leads to stateoftheart pruning results on ImageNet, CIFAR10 and CIFAR100 using ResNet backbones. To prune complex network structures such as convolutions with skiplinks and depthwise convolutions, we propose a block grouping approach to cope with these structures. Through this we produce compact architectures with the same FLOPs as EfficientNetB0 and MobileNetV3 but with higher accuracy, by $1\%$ and $0.3\%$ respectively on ImageNet, and faster runtime on GPU.
 [41] arXiv:2002.08264 [pdf, other]

Title: Molecule Attention TransformerAuthors: Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, Stanisław JastrzębskiSubjects: Machine Learning (cs.LG); Computational Physics (physics.compph); Machine Learning (stat.ML)
Designing a single neural network architecture that performs competitively across a range of molecule property prediction tasks remains largely an open challenge, and its solution may unlock a widespread use of deep learning in the drug discovery industry. To move towards this goal, we propose Molecule Attention Transformer (MAT). Our key innovation is to augment the attention mechanism in Transformer using interatomic distances and the molecular graph structure. Experiments show that MAT performs competitively on a diverse set of molecular prediction tasks. Most importantly, with a simple selfsupervised pretraining, MAT requires tuning of only a few hyperparameter values to achieve stateoftheart performance on downstream tasks. Finally, we show that attention weights learned by MAT are interpretable from the chemical point of view.
 [42] arXiv:2002.08274 [pdf, other]

Title: Outcome Correlation in Graph Neural Network RegressionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Graph neural networks aggregate features in vertex neighborhoods to learn vector representations of all vertices, using supervision from some labeled vertices during training. The predictor is then a function of the vector representation, and predictions are made independently on unlabeled nodes. This widelyadopted approach implicitly assumes that vertex labels are independent after conditioning on their neighborhoods. We show that this strong assumption is far from true on many realworld graph datasets and severely limits predictive power on a number of regression tasks. Given that traditional graphbased semisupervised learning methods operate in the opposite manner by explicitly modeling the correlation in predicted outcomes, this limitation may not be all that surprising.
Here, we address this issue with a simple and interpretable framework that can improve any graph neural network architecture by modeling correlation structure in regression outcome residuals. Specifically, we model the joint distribution of outcome residuals on vertices with a parameterized multivariate Gaussian, where the parameters are estimated by maximizing the marginal likelihood of the observed labels. Our model achieves substantially boosts the performance of graph neural networks, and the learned parameters can also be interpreted as the strength of correlation among connected vertices. To allow us to scale to large networks, we design linear time algorithms for lowvariance, unbiased model parameter estimates based on stochastic trace estimation. We also provide a simplified version of our method that makes stronger assumptions on correlation structure but is extremely easy to implement and provides great practical performance in several cases.  [43] arXiv:2002.08289 [pdf, other]

Title: Variational Encoderbased Reliable ClassificationComments: 7 pages, 6 figuresSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Machine learning models provide statistically impressive results which might be individually unreliable. To provide reliability, we propose an Epistemic Classifier (EC) that can provide justification of its belief using support from the training dataset as well as quality of reconstruction. Our approach is based on modified variational autoencoders that can identify a semantically meaningful lowdimensional space where perceptually similar instances are close in $\ell_2$distance too. Our results demonstrate improved reliability of predictions and robust identification of samples with adversarial attacks as compared to baseline of softmaxbased thresholding.
 [44] arXiv:2002.08329 [pdf, other]

Title: Valuedriven Hindsight ModellingAuthors: Arthur Guez, Fabio Viola, Théophane Weber, Lars Buesing, Steven Kapturowski, Doina Precup, David Silver, Nicolas HeessComments: 8 pages + reference + appendixSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn predictors for value from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. In contrast, modelfree methods directly leverage the quantity of interest from the future but have to compose with a potentially weak scalar signal (an estimate of the return). In this paper we develop an approach for representation learning in RL that sits in between these two extremes: we propose to learn what to model in a way that can directly help value prediction. To this end we determine which features of the future trajectory provide useful information to predict the associated return. This provides us with tractable prediction targets that are directly relevant for a task, and can thus accelerate learning of the value function. The idea can be understood as reasoning, in hindsight, about which aspects of the future observations could help past value prediction. We show how this can help dramatically even in simple policy evaluation settings. We then test our approach at scale in challenging domains, including on 57 Atari 2600 games.
 [45] arXiv:2002.08338 [pdf, ps, other]

Title: Multiple Imputation with Denoising Autoencoder using Metamorphic Truth and Imputation FeedbackAuthors: Hawminn Lu (1), Giancarlo Perrone (1), José Unpingco (1) ((1) Gary and Mary West Health Institute)Subjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Although data may be abundant, complete data is less so, due to missing columns or rows. This missingness undermines the performance of downstream data products that either omit incomplete cases or create derived completed data for subsequent processing. Appropriately managing missing data is required in order to fully exploit and correctly use data. We propose a Multiple Imputation model using Denoising Autoencoders to learn the internal representation of data. Furthermore, we use the novel mechanisms of Metamorphic Truth and Imputation Feedback to maintain statistical integrity of attributes and eliminate bias in the learning process. Our approach explores the effects of imputation on various missingness mechanisms and patterns of missing data, outperforming other methods in many standard test cases.
 [46] arXiv:2002.08339 [pdf, other]

Title: NeuroFabric: Identifying Ideal Topologies for Training A Priori Sparse NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Long training times of deep neural networks are a bottleneck in machine learning research. The major impediment to fast training is the quadratic growth of both memory and compute requirements of dense and convolutional layers with respect to their information bandwidth. Recently, training `a priori' sparse networks has been proposed as a method for allowing layers to retain high information bandwidth, while keeping memory and compute low. However, the choice of which sparse topology should be used in these networks is unclear. In this work, we provide a theoretical foundation for the choice of intralayer topology. First, we derive a new sparse neural network initialization scheme that allows us to explore the space of very deep sparse networks. Next, we evaluate several topologies and show that seemingly similar topologies can often have a large difference in attainable accuracy. To explain these differences, we develop a datafree heuristic that can evaluate a topology independently from the dataset the network will be trained on. We then derive a set of requirements that make a good topology, and arrive at a single topology that satisfies all of them.
 [47] arXiv:2002.08345 [pdf, other]

Title: SchoenbergRao distances: Entropybased and geometryaware statistical Hilbert distancesComments: 18 pages, 8 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Distances between probability distributions that take into account the geometry of their sample space,like the Wasserstein or the Maximum Mean Discrepancy (MMD) distances have received a lot of attention in machine learning as they can, for instance, be used to compare probability distributions with disjoint supports. In this paper, we study a class of statistical Hilbert distances that we term the SchoenbergRao distances, a generalization of the MMD that allows one to consider a broader class of kernels, namely the conditionally negative semidefinite kernels. In particular, we introduce a principled way to construct such kernels and derive novel closedform distances between mixtures of Gaussian distributions, among others. These distances, derived from the concave Rao's quadratic entropy, enjoy nice theoretical properties and possess interpretable hyperparameters which can be tuned for specific applications. Our method constitutes a practical alternative to Wasserstein distances and we illustrate its efficiency on a broad range of machine learning tasks such as density estimation, generative modeling and mixture simplification.
 [48] arXiv:2002.08347 [pdf, other]

Title: On Adaptive Attacks to Adversarial Example DefensesSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPSand chosen for illustrative and pedagogical purposescan be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end resultshowing that a defense was ineffectivethis paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
Crosslists for Thu, 20 Feb 20
 [49] arXiv:2002.07870 (crosslist from cs.RO) [pdf, other]

Title: Online Parameter Estimation for SafetyCritical Systems with Gaussian ProcessesComments: 7 pages, 5 figures, 1 tableSubjects: Robotics (cs.RO); Machine Learning (cs.LG)
Parameter estimation is crucial for modeling, tracking, and control of complex dynamical systems. However, parameter uncertainties can compromise system performance under a controller relying on nominal parameter values. Typically, parameters are estimated using numerical regression approaches framed as inverse problems. However, they suffer from nonuniqueness due to existence of multiple local optima, reliance on gradients, numerous experimental data, or stability issues. Addressing these drawbacks, we present a Bayesian optimization framework based on Gaussian processes (GPs) for online parameter estimation. It uses an efficient search strategy over a response surface in the parameter space for finding the global optima with minimal function evaluations. The response surface is modeled as correlated surrogates using GPs on noisy data. The GP posterior predictive variance is exploited for smart adaptive sampling. This balances the exploration versus exploitation tradeoff which is key in reaching the global optima under limited budget. We demonstrate our technique on an actuated planar pendulum and safetycritical quadrotor in simulation with changing parameters. We also benchmark our results against solvers using interior point method and sequential quadratic program. By reconfiguring the controller with new optimized parameters iteratively, we drastically improve trajectory tracking of the system versus the nominal case and other solvers.
 [50] arXiv:2002.07873 (crosslist from qbio.QM) [pdf, other]

Title: A survey of statistical learning techniques as applied to inexpensive pediatric Obstructive Sleep Apnea dataSubjects: Quantitative Methods (qbio.QM); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
Pediatric obstructive sleep apnea affects an estimated 15% of elementaryschool aged children and can lead to other detrimental health problems. Swift diagnosis and treatment are critical to a child's growth and development, but the variability of symptoms and the complexity of the available data make this a challenge. We take a first step in streamlining the process by focusing on inexpensive data from questionnaires and craniofacial measurements. We apply correlation networks, the Mapper algorithm from topological data analysis, and singular value decomposition in a process of exploratory data analysis. We then apply a variety of supervised and unsupervised learning techniques from statistics, machine learning, and topology, ranging from support vector machines to Bayesian classifiers and manifold learning. Finally, we analyze the results of each of these methods and discuss the implications for a multidatasourced algorithm moving forward.
 [51] arXiv:2002.07874 (crosslist from qbio.QM) [pdf, other]

Title: Ensemble Deep Learning on Large, MixedSite fMRI Datasets in Autism and Other TasksSubjects: Quantitative Methods (qbio.QM); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (qbio.NC); Machine Learning (stat.ML)
Deep learning models for MRI classification face two recurring problems: they are typically limited by low sample size, and are abstracted by their own complexity (the "black box problem"). In this paper, we train a convolutional neural network (CNN) with the largest multisource, functional MRI (fMRI) connectomic dataset ever compiled, consisting of 43,858 datapoints. We apply this model to a crosssectional comparison of autism (ASD) vs typically developing (TD) controls that has proved difficult to characterise with inferential statistics. To contextualise these findings, we additionally perform classifications of gender and task vs rest. Employing classbalancing to build a training set, we trained 3$\times$300 modified CNNs in an ensemble model to classify fMRI connectivity matrices with overall AUROCs of 0.6774, 0.7680, and 0.9222 for ASD vs TD, gender, and task vs rest, respectively. Additionally, we aim to address the black box problem in this context using two visualization methods. First, class activation maps show which functional connections of the brain our models focus on when performing classification. Second, by analyzing maximal activations of the hidden layers, we were also able to explore how the model organizes a large and mixedcentre dataset, finding that it dedicates specific areas of its hidden layers to processing different covariates of data (depending on the independent variable analyzed), and other areas to mix data from different sources. Our study finds that deep learning models that distinguish ASD from TD controls focus broadly on temporal and cerebellar connections, with a particularly high focus on the right caudate nucleus and paracentral sulcus.
 [52] arXiv:2002.07877 (crosslist from cs.IR) [pdf, other]

Title: CBIR using features derived by Deep LearningComments: 18 pages, 31 figuresSubjects: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
In a Content Based Image Retrieval (CBIR) System, the task is to retrieve similar images from a large database given a query image. The usual procedure is to extract some useful features from the query image, and retrieve images which have similar set of features. For this purpose, a suitable similarity measure is chosen, and images with high similarity scores are retrieved. Naturally the choice of these features play a very important role in the success of this system, and high level features are required to reduce the semantic gap.
In this paper, we propose to use features derived from pretrained network models from a deeplearning convolution network trained for a large image classification problem. This approach appears to produce vastly superior results for a variety of databases, and it outperforms many contemporary CBIR systems. We analyse the retrieval time of the method, and also propose a preclustering of the database based on the abovementioned features which yields comparable results in a much shorter time in most of the cases.  [53] arXiv:2002.07884 (crosslist from stat.ML) [pdf, ps, other]

Title: Observational nonidentifiability, generalized likelihood and free energyAuthors: A.E. AllahverdyanComments: 25 pages, 1 figureSubjects: Machine Learning (stat.ML); Statistical Mechanics (condmat.statmech); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.dataan)
We study the parameter estimation problem in mixture models with observational nonidentifiability: the full model (also containing hidden variables) is identifiable, but the marginal (observed) model is not. Hence global maxima of the marginal likelihood are (infinitely) degenerate and predictions of the marginal likelihood are not unique. We show how to generalize the marginal likelihood by introducing an effective temperature, and making it similar to the free energy. This generalization resolves the observational nonidentifiability, since its maximization leads to unique results that are better than a random selection of one degenerate maximum of the marginal likelihood or the averaging over many such maxima. The generalized likelihood inherits many features from the usual likelihood, e.g. it holds the conditionality principle, and its local maximum can be searched for via suitably modified expectationmaximization method. The maximization of the generalized likelihood relates to entropy optimization.
 [54] arXiv:2002.07897 (crosslist from eess.IV) [pdf, other]

Title: LocoGAN  Locally Convolutional GANSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
In the paper we construct a fully convolutional GAN model: LocoGAN, which latent space is given by noiselike images of possibly different resolutions. The learning is local, i.e. we process not the whole noiselike image, but the subimages of a fixed size. As a consequence LocoGAN can produce images of arbitrary dimensions e.g. LSUN bedroom data set. Another advantage of our approach comes from the fact that we use the position channels, which allows the generation of fully periodic (e.g. cylindrical panoramic images) or almost periodic ,,infinitely long" images (e.g. wallpapers).
 [55] arXiv:2002.07940 (crosslist from astroph.CO) [pdf, other]

Title: A unified framework for 21cm tomography sample generation and parameter inference with Progressively Growing GANsComments: 15 pages, 8+1 figures, accepted by MNRASSubjects: Cosmology and Nongalactic Astrophysics (astroph.CO); Instrumentation and Methods for Astrophysics (astroph.IM); Machine Learning (cs.LG)
Creating a database of 21cm brightness temperature signals from the Epoch of Reionisation (EoR) for an array of reionisation histories is a complex and computationally expensive task, given the range of astrophysical processes involved and the possibly highdimensional parameter space that is to be probed. We utilise a specific type of neural network, a Progressively Growing Generative Adversarial Network (PGGAN), to produce realistic tomography images of the 21cm brightness temperature during the EoR, covering a continuous threedimensional parameter space that models varying Xray emissivity, Lyman band emissivity, and ratio between hard and soft Xrays. The GPUtrained network generates new samples at a resolution of $\sim 3'$ in a second (on a laptop CPU), and the resulting global 21cm signal, power spectrum, and pixel distribution function agree well with those of the training data, taken from the 21SSD catalogue \citep{Semelin2017}. Finally, we showcase how a trained PGGAN can be leveraged for the converse task of inferring parameters from 21cm tomography samples via Approximate Bayesian Computation.
 [56] arXiv:2002.07964 (crosslist from stat.AP) [pdf]

Title: Tourism Demand Forecasting with Tourist Attention: An Ensemble Deep Learning ApproachSubjects: Applications (stat.AP); Machine Learning (cs.LG); Econometrics (econ.EM)
The large amount of tourismrelated data presents a series of challenges for tourism demand forecasting, including data deficiencies, multicollinearity and long calculation time. A Baggingbased multivariate ensemble deep learning model, integrating Stacked Autoencoders and KELM (BSAKE) is proposed to address these challenges in this study. We forecast tourist arrivals arriving in Beijing from four countries adopting historical data on tourist arrivals arriving in Beijing, economic indicators and tourist online behavior variables. The results from the cases of four origin countries suggest that our proposed BSAKE model outperforms than benchmark models whether in horizontal accuracy, directional accuracy or statistical significance. Both Bagging and Stacked Autoencoder can improve the forecasting performance of the models. Moreover, the forecasting performance of the models is evaluated with consistent results by means of the multistepahead forecasting scheme.
 [57] arXiv:2002.08014 (crosslist from stat.ML) [pdf, other]

Title: CommunicationEfficient Distributed SVD via Local Power IterationsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
We study the distributed computing of the truncated singular value decomposition (SVD). We develop an algorithm that we call \texttt{LocalPower} for improving the communication efficiency. Specifically, we uniformly partition the dataset among $m$ nodes and alternate between multiple (precisely $p$) local power iterations and one global aggregation. We theoretically show that under certain assumptions, \texttt{LocalPower} lowers the required number of communications by a factor of $p$ to reach a certain accuracy. We also show that the strategy of periodically decaying $p$ helps improve the performance of \texttt{LocalPower}. We conduct experiments to demonstrate the effectiveness of \texttt{LocalPower}.
 [58] arXiv:2002.08021 (crosslist from stat.AP) [pdf]

Title: Seasonal and Trend Forecasting of Tourist Arrivals: An Adaptive Multiscale Ensemble Learning ApproachSubjects: Applications (stat.AP); Machine Learning (cs.LG); Econometrics (econ.EM)
The accurate seasonal and trend forecasting of tourist arrivals is a very challenging task. In the view of the importance of seasonal and trend forecasting of tourist arrivals, and limited research work paid attention to these previously. In this study, a new adaptive multiscale ensemble (AME) learning approach incorporating variational mode decomposition (VMD) and least square support vector regression (LSSVR) is developed for short, medium, and longterm seasonal and trend forecasting of tourist arrivals. In the formulation of our developed AME learning approach, the original tourist arrivals series are first decomposed into the trend, seasonal and remainders volatility components. Then, the ARIMA is used to forecast the trend component, the SARIMA is used to forecast seasonal component with a 12month cycle, while the LSSVR is used to forecast remainder volatility components. Finally, the forecasting results of the three components are aggregated to generate an ensemble forecasting of tourist arrivals by the LSSVR based nonlinear ensemble approach. Furthermore, a direct strategy is used to implement multistepahead forecasting. Taking two accuracy measures and the DieboldMariano test, the empirical results demonstrate that our proposed AME learning approach can achieve higher level and directional forecasting accuracy compared with other benchmarks used in this study, indicating that our proposed approach is a promising model for forecasting tourist arrivals with high seasonality and volatility.
 [59] arXiv:2002.08024 (crosslist from cs.CL) [pdf, other]

Title: NonAutoregressive Dialog State TrackingComments: Accepted at ICLR 2020Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Recent efforts in Dialogue State Tracking (DST) for taskoriented dialogues have progressed toward openvocabulary or generationbased approaches where the models can generate slot value candidates from the dialogue history itself. These approaches have shown good performance gain, especially in complicated dialogue domains with dynamic slot values. However, they fall short in two aspects: (1) they do not allow models to explicitly learn signals across domains and slots to detect potential dependencies among (domain, slot) pairs; and (2) existing models follow autoregressive approaches which incur high time cost when the dialogue evolves over multiple domains and multiple turns. In this paper, we propose a novel framework of NonAutoregressive Dialog State Tracking (NADST) which can factor in potential dependencies among domains and slots to optimize the models towards better prediction of dialogue states as a complete set rather than separate slots. In particular, the nonautoregressive nature of our method not only enables decoding in parallel to significantly reduce the latency of DST for realtime dialogue response generation, but also detect dependencies among slots at token level in addition to slot and domain level. Our empirical results show that our model achieves the stateoftheart joint accuracy across all domains on the MultiWOZ 2.1 corpus, and the latency of our model is an order of magnitude lower than the previous state of the art as the dialogue history extends over time.
 [60] arXiv:2002.08025 (crosslist from cs.CR) [pdf, other]

Title: Influence Function based Data Poisoning Attacks to TopN Recommender SystemsComments: Accepted by WWW 2020; This is technical report versionSubjects: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
Recommender system is an essential component of web services to engage users. Popular recommender systems model user preferences and item properties using a large amount of crowdsourced useritem interaction data, e.g., rating scores; then top$N$ items that match the best with a user's preference are recommended to the user. In this work, we show that an attacker can launch a data poisoning attack to a recommender system to make recommendations as the attacker desires via injecting fake users with carefully crafted useritem interaction data. Specifically, an attacker can trick a recommender system to recommend a target item to as many normal users as possible. We focus on matrix factorization based recommender systems because they have been widely deployed in industry. Given the number of fake users the attacker can inject, we formulate the crafting of rating scores for the fake users as an optimization problem. However, this optimization problem is challenging to solve as it is a nonconvex integer programming problem. To address the challenge, we develop several techniques to approximately solve the optimization problem. For instance, we leverage influence function to select a subset of normal users who are influential to the recommendations and solve our formulated optimization problem based on these influential users. Our results show that our attacks are effective and outperform existing methods.
 [61] arXiv:2002.08027 (crosslist from cs.CR) [pdf, other]

Title: Toward LowCost and Stable Blockchain NetworksComments: Accepted by IEEE ICC 2020Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Envisioned to be the future of distributed systems, blockchain networks have received increasing attentions from both industry and academic research in recent years. However, the blockchain mining process consumes vast amounts of energy, and studies have shown that the amount of energy consumed in Bitcoin mining is almost the same as electricity used in Ireland. To address the high mining energy cost problem of blockchain networks, in this paper, we propose a blockchain mining resources allocation algorithm to reduce the mining cost in PoWbased (proofofworkbased) blockchain networks. We first provide a systematic study on general blockchain queueing model. In our queueing model, transactions arrive randomly to the queue and served in a batch manner with unknown probability distribution and agnostic to any priority mechanism. Then, we leverage Lyapunov optimization techniques to propose a dynamic mining resources allocation algorithm (DMRA), which is parameterized by a tuning parameter $K>0$. We show that our algorithm achieves performancedelay tradeoff as $[O(1/K), O(K)]$. The simulation results also demonstrate the effectiveness of DMRA in reducing the mining cost.
 [62] arXiv:2002.08114 (crosslist from cs.DM) [pdf, other]

Title: BB_Evac: Fast LocationSensitive BehaviorBased Building EvacuationSubjects: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Past work on evacuation planning assumes that evacuees will follow instructions  however, there is ample evidence that this is not the case. While some people will follow instructions, others will follow their own desires. In this paper, we present a formal definition of a behaviorbased evacuation problem (BBEP) in which a human behavior model is taken into account when planning an evacuation. We show that a specific form of constraints can be used to express such behaviors. We show that BBEPs can be solved exactly via an integer program called BB_IP, and inexactly by a much faster algorithm that we call BB_Evac. We conducted a detailed experimental evaluation of both algorithms applied to buildings (though in principle the algorithms can be applied to any graphs) and show that the latter is an order of magnitude faster than BB_IP while producing results that are almost as good on one realworld building graph and as well as on several synthetically generated graphs.
 [63] arXiv:2002.08126 (crosslist from cs.CL) [pdf, ps, other]

Title: Rnntransducer with language bias for endtoend MandarinEnglish codeswitching speech recognitionSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recently, language identity information has been utilized to improve the performance of endtoend codeswitching (CS) speech recognition. However, previous works use an additional language identification (LID) model as an auxiliary module, which causes the system complex. In this work, we propose an improved recurrent neural network transducer (RNNT) model with language bias to alleviate the problem. We use the language identities to bias the model to predict the CS points. This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed. We evaluate the approach on a MandarinEnglish CS corpus SEAME. Compared to our RNNT baseline, the proposed method can achieve 16.2% and 12.9% relative error reduction on two test sets, respectively.
 [64] arXiv:2002.08129 (crosslist from stat.ML) [pdf, other]

Title: Bayesian Experimental Design for Implicit Models by Mutual Information Neural EstimationComments: Conference submissionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
Implicit stochastic models, where the datageneration distribution is intractable but sampling is possible, are ubiquitous in the natural sciences. The models typically have free parameters that need to be inferred from data collected in scientific experiments. A fundamental question is how to design the experiments so that the collected data are most useful. The field of Bayesian experimental design advocates that, ideally, we should choose designs that maximise the mutual information (MI) between the data and the parameters. For implicit models, however, this approach is severely hampered by the high computational cost of computing posteriors and maximising MI, in particular when we have more than a handful of design variables to optimise. In this paper, we propose a new approach to Bayesian experimental design for implicit models that leverages recent advances in neural MI estimation to deal with these issues. We show that training a neural network to maximise a lower bound on MI allows us to jointly determine the optimal design and the posterior. Simulation studies illustrate that this gracefully extends Bayesian experimental design for implicit models to higher design dimensions.
 [65] arXiv:2002.08158 (crosslist from eess.IV) [pdf, other]

Title: VariableBitrate Neural Compression via Bayesian Arithmetic CodingComments: 8 pages + detailed supplement with additional full resolution reconstructed imagesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Deep Bayesian latent variable models have enabled new approaches to both model and data compression. Here, we propose a new algorithm for compressing latent representations in deep probabilistic models, such as variational autoencoders, in postprocessing. The approach thus separates model design and training from the compression task. Our algorithm generalizes arithmetic coding to the continuous domain, using adaptive discretization accuracy that exploits estimates of posterior uncertainty. A consequence of the "plug and play" nature of our approach is that various ratedistortion tradeoffs can be achieved with a single trained model, eliminating the need to train multiple models for different bit rates. Our experimental results demonstrate the importance of taking into account posterior uncertainties, and show that image compression with the proposed algorithm outperforms JPEG over a wide range of bit rates using only a single machine learning model. Further experiments on Bayesian neural word embeddings demonstrate the versatility of the proposed method.
 [66] arXiv:2002.08159 (crosslist from stat.ML) [pdf, other]

Title: Learning Fair Scoring Functions: Fairness Definitions, Algorithms and Generalization Bounds for Bipartite RankingComments: 27 pages, 11 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Many applications of artificial intelligence, ranging from credit lending to the design of medical diagnosis support tools through recidivism prediction, involve scoring individuals using a learned function of their attributes. These predictive risk scores are used to rank a set of people, and/or take individual decisions about them based on whether the score exceeds a certain threshold that may depend on the context in which the decision is taken. The level of delegation granted to such systems will heavily depend on how questions of fairness can be answered. While this concern has received a lot of attention in the classification setup, the design of relevant fairness constraints for the problem of learning scoring functions has not been much investigated. In this paper, we propose a flexible approach to group fairness for the scoring problem with binary labeled data, a standard learning task referred to as bipartite ranking. We argue that the functional nature of the ROC curve, the gold standard measuring ranking performance in this context, leads to several possible ways of formulating fairness constraints. We introduce general classes of fairness conditions in bipartite ranking and establish generalization bounds for scoring rules learned under such constraints. Beyond the theoretical formulation and results, we design practical learning algorithms and illustrate our approach with numerical experiments.
 [67] arXiv:2002.08235 (crosslist from cs.CE) [pdf, other]

Title: Physicsinformed Neural Networks for Solving Nonlinear Diffusivity and Biot's equationsSubjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
This paper presents the potential of applying physicsinformed neural networks for solving nonlinear multiphysics problems, which are essential to many fields such as biomedical engineering, earthquake prediction, and underground energy harvesting. Specifically, we investigate how to extend the methodology of physicsinformed neural networks to solve both the forward and inverse problems in relation to the nonlinear diffusivity and Biot's equations. We explore the accuracy of the physicsinformed neural networks with different training example sizes and choices of hyperparameters. The impacts of the stochastic variations between various training realizations are also investigated. In the inverse case, we also study the effects of noisy measurements. Furthermore, we address the challenge of selecting the hyperparameters of the inverse model and illustrate how this challenge is linked to the hyperparameters selection performed for the forward one.
 [68] arXiv:2002.08240 (crosslist from quantph) [pdf, ps, other]

Title: Quantum statistical query learningComments: 24 PagesSubjects: Quantum Physics (quantph); Computational Complexity (cs.CC); Machine Learning (cs.LG)
We propose a learning model called the quantum statistical learning QSQ model, which extends the SQ learning model introduced by Kearns to the quantum setting. Our model can be also seen as a restriction of the quantum PAC learning model: here, the learner does not have direct access to quantum examples, but can only obtain estimates of measurement statistics on them. Theoretically, this model provides a simple yet expressive setting to explore the power of quantum examples in machine learning. From a practical perspective, since simpler operations are required, learning algorithms in the QSQ model are more feasible for implementation on nearterm quantum devices. We prove a number of results about the QSQ learning model. We first show that parity functions, (log n)juntas and polynomialsized DNF formulas are efficiently learnable in the QSQ model, in contrast to the classical setting where these problems are provably hard. This implies that many of the advantages of quantum PAC learning can be realized even in the more restricted quantum SQ learning model. It is wellknown that weak statistical query dimension, denoted by WSQDIM(C), characterizes the complexity of learning a concept class C in the classical SQ model. We show that log(WSQDIM(C)) is a lower bound on the complexity of QSQ learning, and furthermore it is tight for certain concept classes C. Additionally, we show that this quantity provides strong lower bounds for the smallbias quantum communication model under product distributions. Finally, we introduce the notion of private quantum PAC learning, in which a quantum PAC learner is required to be differentially private. We show that learnability in the QSQ model implies learnability in the quantum private PAC model. Additionally, we show that in the private PAC learning setting, the classical and quantum sample complexities are equal, up to constant factors.
 [69] arXiv:2002.08246 (crosslist from math.OC) [pdf, other]

Title: A Unified Convergence Analysis for ShufflingType Gradient MethodsSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper, we provide a unified convergence analysis for a class of shufflingtype gradient methods for solving a wellknown finitesum minimization problem commonly used in machine learning. This algorithm covers various variants such as randomized reshuffling, single shuffling, and cyclic/incremental gradient schemes. We consider two different settings: strongly convex and nonconvex problems. Our main contribution consists of new nonasymptotic and asymptotic convergence rates for a general class of shufflingtype gradient methods to solve both nonconvex and strongly convex problems. While our rate in the nonconvex problem is new (i.e. not known yet under standard assumptions), the rate on the strongly convex case matches (up to a constant) the bestknown results. However, unlike existing works in this direction, we only use standard assumptions such as smoothness and strong convexity. Finally, we empirically illustrate the effect of learning rates via a nonconvex logistic regression and neural network examples.
 [70] arXiv:2002.08249 (crosslist from eess.AS) [pdf, other]

Title: Workshop Report: Detection and Classification in Marine Bioacoustics with Deep LearningComments: 13 pages, 1 figure, 1 tableSubjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
On 2122 November 2019, about 30 researchers gathered in Victoria, BC, Canada, for the workshop "Detection and Classification in Marine Bioacoustics with Deep Learning" organized by MERIDIAN and hosted by Ocean Networks Canada. The workshop was attended by marine biologists, data scientists, and computer scientists coming from both Canadian coasts and the US and representing a wide spectrum of research organizations including universities, government (Fisheries and Oceans Canada, National Oceanic and Atmospheric Administration), industry (JASCO Applied Sciences, Google, Axiom Data Science), and nonforprofits (Orcasound, OrcaLab). Consisting of a mix of oral presentations, open discussion sessions, and handson tutorials, the workshop program offered a rare opportunity for specialists from distinctly different domains to engage in conversation about deep learning and its promising potential for the development of detection and classification algorithms in underwater acoustics. In this workshop report, we summarize key points from the presentations and discussion sessions.
 [71] arXiv:2002.08253 (crosslist from stat.ML) [pdf, ps, other]

Title: DistanceBased Regularisation of Deep Networks for FineTuningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We investigate approaches to regularisation during finetuning of deep neural networks. First we provide a neural network generalisation bound based on Rademacher complexity that uses the distance the weights have moved from their initial values. This bound has no direct dependence on the number of weights and compares favourably to other bounds when applied to convolutional networks. Our bound is highly relevant for finetuning, because providing a network with a good initialisation based on transfer learning means that learning can modify the weights less, and hence achieve tighter generalisation. Inspired by this, we develop a simple yet effective finetuning algorithm that constrains the hypothesis class to a small sphere centred on the initial pretrained weights, thus obtaining provably better generalisation performance than conventional transfer learning. Empirical evaluation shows that our algorithm works well, corroborating our theoretical results. It outperforms both state of the art finetuning competitors, and penaltybased alternatives that we show do not directly constrain the radius of the search space.
 [72] arXiv:2002.08260 (crosslist from stat.ML) [pdf, other]

Title: Learning Bounds for MomentBased Domain AdaptationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Domain adaptation algorithms are designed to minimize the misclassification risk of a discriminative model for a target domain with little training data by adapting a model from a source domain with a large amount of training data. Standard approaches measure the adaptation discrepancy based on distance measures between the empirical probability distributions in the source and target domain. In this setting, we address the problem of deriving learning bounds under practiceoriented general conditions on the underlying probability distributions. As a result, we obtain learning bounds for domain adaptation based on finitely many moments and smoothness conditions.
 [73] arXiv:2002.08267 (crosslist from cs.CL) [pdf]

Title: MultilogueNet: A Context Aware RNN for Multimodal Emotion Detection and Sentiment Analysis in ConversationComments: 10 pages, 4 figures, 6 tablesSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Sentiment Analysis and Emotion Detection in conversation is key in a number of realworld applications, with different applications leveraging different kinds of data to be able to achieve reasonably accurate predictions. Multimodal Emotion Detection and Sentiment Analysis can be particularly useful as applications will be able to use specific subsets of the available modalities, as per their available data, to be able to produce relevant predictions. Current systems dealing with Multimodal functionality fail to leverage and capture the context of the conversation through all modalities, the current speaker and listener(s) in the conversation, and the relevance and relationship between the available modalities through an adequate fusion mechanism. In this paper, we propose a recurrent neural network architecture that attempts to take into account all the mentioned drawbacks, and keeps track of the context of the conversation, interlocutor states, and the emotions conveyed by the speakers in the conversation. Our proposed model out performs the state of the art on two benchmark datasets on a variety of accuracy and regression metrics. Our model implementation is public and can be found at github.com/amanshenoy/multiloguenet
 [74] arXiv:2002.08276 (crosslist from stat.ML) [pdf, other]

Title: Partial GromovWasserstein with Applications on PositiveUnlabeled LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Optimal Transport (OT) framework allows defining similarity between probability distributions and provides metrics such as the Wasserstein and GromovWasserstein discrepancies. Classical OT problem seeks a transportation map that preserves the total mass, requiring the mass of the source and target distributions to be the same. This may be too restrictive in certain applications such as color or shape matching, since the distributions may have arbitrary masses or that only a fraction of the total mass has to be transported. Several algorithms have been devised for computing unbalanced Wasserstein metrics but when it comes with the GromovWasserstein problem, no partial formulation is available yet. This precludes from working with distributions that do not lie in the same metric space or when invariance to rotation or translation is needed. In this paper, we address the partial GromovWasserstein problem and propose an algorithm to solve it. We showcase the new formulation in a positiveunlabeled (PU) learning application. To the best of our knowledge, this is the first application of optimal transport in this context and we first highlight that partial Wassersteinbased metrics prove effective in usual PU learning settings. We then demonstrate that partial GromovWasserstein metrics is efficient in scenario where point clouds come from different domains or have different features.
 [75] arXiv:2002.08277 (crosslist from cs.CV) [pdf, other]

Title: When Radiology Report Generation Meets Knowledge GraphSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Automatic radiology report generation has been an attracting research problem towards computeraided diagnosis to alleviate the workload of doctors in recent years. Deep learning techniques for natural image captioning are successfully adapted to generating radiology reports. However, radiology image reporting is different from the natural image captioning task in two aspects: 1) the accuracy of positive disease keyword mentions is critical in radiology image reporting in comparison to the equivalent importance of every single word in a natural image caption; 2) the evaluation of reporting quality should focus more on matching the disease keywords and their associated attributes instead of counting the occurrence of Ngram. Based on these concerns, we propose to utilize a preconstructed graph embedding module (modeled with a graph convolutional neural network) on multiple disease findings to assist the generation of reports in this work. The incorporation of knowledge graph allows for dedicated feature learning for each disease finding and the relationship modeling between them. In addition, we proposed a new evaluation metric for radiology image reporting with the assistance of the same composed graph. Experimental results demonstrate the superior performance of the methods integrated with the proposed graph embedding module on a publicly accessible dataset (IURR) of chest radiographs compared with previous approaches using both the conventional evaluation metrics commonly adopted for image captioning and our proposed ones.
 [76] arXiv:2002.08295 (crosslist from cs.DC) [pdf, other]

Title: MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at ScaleSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE); Machine Learning (stat.ML)
Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hardpressed to analyze and study them. The complicated procedures for evaluating innovations, along with the lack of standard and efficient ways of specifying and provisioning ML/DL evaluation, is a major "pain point" for the community. This paper proposes MLModelScope, an opensource, framework/hardware agnostic, extensible and customizable design that enables repeatable, fair, and scalable model evaluation and benchmarking.
We implement the distributed design with support for all major frameworks and hardware, and equip it with web, commandline, and library interfaces. To demonstrate MLModelScope's capabilities we perform parallel evaluation and show how subtle changes to model evaluation pipeline affects the accuracy and HW/SW stack choices affect performance.  [77] arXiv:2002.08301 (crosslist from eess.IV) [pdf, ps, other]

Title: Multiwavelet residual dense convolutional neural network for image denoisingComments: 9 pages, 9 figuresSubjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Networks with large receptive field (RF) have shown advanced fitting ability in recent years. In this work, we utilize the shortterm residual learning method to improve the performance and robustness of networks for image denoising tasks. Here, we choose a multiwavelet convolutional neural network (MWCNN), one of the stateofart networks with large RF, as the backbone, and insert residual dense blocks (RDBs) in its each layer. We call this scheme multiwavelet residual dense convolutional neural network (MWRDCNN). Compared with other RDBbased networks, it can extract more features of the object from adjacent layers, preserve the large RF, and boost the computing efficiency. Meanwhile, this approach also provides a possibility of absorbing advantages of multiple architectures in a single network without conflicts. The performance of the proposed method has been demonstrated in extensive experiments with a comparison with existing techniques.
 [78] arXiv:2002.08313 (crosslist from cs.CR) [pdf, other]

Title: NNoculation: Broad Spectrum and Targeted Treatment of Backdoored DNNsAuthors: Akshaj Kumar Veldanda, Kang Liu, Benjamin Tan, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Brendan DolanGavitt, Siddharth GargSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
This paper proposes a novel twostage defense (NNoculation) against backdoored neural networks (BadNets) that, unlike existing defenses, makes minimal assumptions on the shape, size and location of backdoor triggers and BadNet's functioning. In the predeployment stage, NNoculation retrains the network using "broadspectrum" random perturbations of inputs drawn from a clean validation set to partially reduce the adversarial impact of a backdoor. In the postdeployment stage, NNoculation detects and quarantines backdoored test inputs by recording disagreements between the original and predeployment patched networks. A CycleGAN is then trained to learn transformations between clean validation inputs and quarantined inputs; i.e., it learns to add triggers to clean validation images. This transformed set of backdoored validation images along with their correct labels is used to further retrain the BadNet, yielding our final defense. NNoculation outperforms stateoftheart defenses NeuralCleanse and Artificial Brain Simulation (ABS) that we show are ineffective when their restrictive assumptions are circumvented by the attacker.
 [79] arXiv:2002.08314 (crosslist from stat.ML) [pdf, other]

Title: NonAligned Distribution Distance using Metric Measure Embedding and Optimal TransportSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We propose a novel approach for comparing distributions whose supports do not necessarily lie on the same metric space. Unlike GromovWasserstein (GW) distance that compares pairwise distance of elements from each distribution, we consider a method that embeds the metric measure spaces in a common Euclidean space and computes an optimal transport (OT) on the embedded distributions. This leads to what we call a subembedding robust Wasserstein(SERW). Under some conditions, SERW is a distance that considers an OT distance of the (lowdistorted) embedded distributions using a common metric. In addition to this novel proposal that generalizes several recent OT works, our contributions stand on several theoretical analyses: i) we characterize the embedding spaces to define SERW distance for distribution alignment; ii) we prove that SERW mimics almost the same properties of GW distance, and we give a cost relation between GW and SERW. The paper also provides some numerical experiments illustrating how SERW behaves on matching problems in realworld.
 [80] arXiv:2002.08320 (crosslist from cs.CR) [html]

Title: Proceedings of the Artificial Intelligence for Cyber Security (AICS) Workshop 2020Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); HumanComputer Interaction (cs.HC); Machine Learning (cs.LG)
The workshop will focus on the application of artificial intelligence to problems in cyber security. AICS 2020 emphasis will be on humanmachine teaming within the context of cyber security problems and will specifically explore collaboration between human operators and AI technologies. The workshop will address applicable areas of AI, such as machine learning, game theory, natural language processing, knowledge representation, automated and assistive reasoning and human machine interactions. Further, cyber security application areas with a particular emphasis on the characterization and deployment of humanmachine teaming will be the focus.
 [81] arXiv:2002.08326 (crosslist from cs.DC) [pdf, other]

Title: Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPUSystolic Array IntegrationAuthors: Cong Guo, Yangjie Zhou, Jingwen Leng, Yuhao Zhu, Zidong Du, Quan Chen, Chao Li, Minyi Guo, Bin YaoComments: Accepted by DAC2020Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
The research interest in specialized hardware accelerators for deep neural networks (DNN) spiked recently owing to their superior performance and efficiency. However, today's DNN accelerators primarily focus on accelerating specific "kernels" such as convolution and matrix multiplication, which are vital but only part of an endtoend DNNenabled application. Meaningful speedups over the entire application often require supporting computations that are, while massively parallel, illsuited to DNN accelerators. Integrating a generalpurpose processor such as a CPU or a GPU incurs significant data movement overhead and leads to resource underutilization on the DNN accelerators.
We propose Simultaneous Multimode Architecture (SMA), a novel architecture design and execution model that offers generalpurpose programmability on DNN accelerators in order to accelerate endtoend applications. The key to SMA is the temporal integration of the systolic execution model with the GPUlike SIMD execution model. The SMA exploits the common components shared between the systolicarray accelerator and the GPU, and provides lightweight reconfiguration capability to switch between the two modes insitu. The SMA achieves up to 63% performance improvement while consuming 23% less energy than the baseline Volta architecture with TensorCore.  [82] arXiv:2002.08327 (crosslist from cs.CR) [pdf, ps, other]

Title: Fawkes: Protecting Personal Privacy against Unauthorized Deep Learning ModelsSubjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Today's proliferation of powerful facial recognition models poses a real threat to personal privacy. As Clearview.ai demonstrated, anyone can canvas the Internet for data, and train highly accurate facial recognition models of us without our knowledge. We need tools to protect ourselves from unauthorized facial recognition systems and their numerous potential misuses. Unfortunately, work in related areas are limited in practicality and effectiveness. In this paper, we propose Fawkes, a system that allow individuals to inoculate themselves against unauthorized facial recognition models. Fawkes achieves this by helping users adding imperceptible pixellevel changes (we call them "cloaks") to their own photos before publishing them online. When collected by a thirdparty "tracker" and used to train facial recognition models, these "cloaked" images produce functional models that consistently misidentify the user. We experimentally prove that Fawkes provides 95+% protection against user recognition regardless of how trackers train their models. Even when clean, uncloaked images are "leaked" to the tracker and used for training, Fawkes can still maintain a 80+% protection success rate. In fact, we perform real experiments against today's stateoftheart facial recognition services and achieve 100% success. Finally, we show that Fawkes is robust against a variety of countermeasures that try to detect or disrupt cloaks.
 [83] arXiv:2002.08333 (crosslist from cs.RO) [pdf, other]

Title: Towards Intelligent Pick and Place Assembly of Individualized Products Using Reinforcement LearningSubjects: Robotics (cs.RO); Machine Learning (cs.LG)
Individualized manufacturing is becoming an important approach as a means to fulfill increasingly diverse and specific consumer requirements and expectations. While there are various solutions to the implementation of the manufacturing process, such as additive manufacturing, the subsequent automated assembly remains a challenging task. As an approach to this problem, we aim to teach a collaborative robot to successfully perform pick and place tasks by implementing reinforcement learning. For the assembly of an individualized product in a constantly changing manufacturing environment, the simulated geometric and dynamic parameters will be varied. Using reinforcement learning algorithms capable of metalearning, the tasks will first be trained in simulation. They will then be performed in a realworld environment where new factors are introduced that were not simulated in training to confirm the robustness of the algorithms. The robot will gain its input data from tactile sensors, area scan cameras, and 3D cameras used to generate heightmaps of the environment and the objects. The selection of machine learning algorithms and hardware components as well as further research questions to realize the outlined production scenario are the results of the presented work.
 [84] arXiv:2002.08335 (crosslist from stat.ML) [pdf, other]

Title: Deep regularization and direct training of the inner layers of Neural Networks with Kernel FlowsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We introduce a new regularization method for Artificial Neural Networks (ANNs) based on Kernel Flows (KFs). KFs were introduced as a method for kernel selection in regression/kriging based on the minimization of the loss of accuracy incurred by halving the number of interpolation points in random batches of the dataset. Writing $f_\theta(x) = \big(f^{(n)}_{\theta_n}\circ f^{(n1)}_{\theta_{n1}} \circ \dots \circ f^{(1)}_{\theta_1}\big)(x)$ for the functional representation of compositional structure of the ANN, the inner layers outputs $h^{(i)}(x) = \big(f^{(i)}_{\theta_i}\circ f^{(i1)}_{\theta_{i1}} \circ \dots \circ f^{(1)}_{\theta_1}\big)(x)$ define a hierarchy of feature maps and kernels $k^{(i)}(x,x')=\exp( \gamma_i \h^{(i)}(x)h^{(i)}(x')\_2^2)$. When combined with a batch of the dataset these kernels produce KF losses $e_2^{(i)}$ (the $L^2$ regression error incurred by using a random half of the batch to predict the other half) depending on parameters of inner layers $\theta_1,\ldots,\theta_i$ (and $\gamma_i$). The proposed method simply consists in aggregating a subset of these KF losses with a classical output loss. We test the proposed method on CNNs and WRNs without alteration of structure nor output classifier and report reduced test errors, decreased generalization gaps, and increased robustness to distribution shift without significant increase in computational complexity. We suspect that these results might be explained by the fact that while conventional training only employs a linear functional (a generalized moment) of the empirical distribution defined by the dataset and can be prone to trapping in the Neural Tangent Kernel regime (under overparameterizations), the proposed loss function (defined as a nonlinear functional of the empirical distribution) effectively trains the underlying kernel defined by the CNN beyond regressing the data with that kernel.
Replacements for Thu, 20 Feb 20
 [85] arXiv:1804.07090 (replaced) [pdf, other]

Title: Robustness via Deep LowRank RepresentationsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [86] arXiv:1808.10648 (replaced) [pdf, other]

Title: Adaptation and Robust Learning of Probabilistic Movement PrimitivesSubjects: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
 [87] arXiv:1905.11027 (replaced) [pdf, other]

Title: Lightlike Neuromanifolds, Occam's Razor and Deep LearningComments: Under review in ICML 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [88] arXiv:1905.11926 (replaced) [pdf, other]

Title: Network DeconvolutionAuthors: Chengxi Ye, Matthew Evanusa, Hua He, Anton Mitrokhin, Thomas Goldstein, James A. Yorke, Cornelia Fermüller, Yiannis AloimonosComments: ICLR 2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
 [89] arXiv:1905.12265 (replaced) [pdf, other]

Title: Strategies for Pretraining Graph Neural NetworksAuthors: Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, Jure LeskovecComments: Accepted as a spotlight to ICLR 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [90] arXiv:1905.12726 (replaced) [pdf, other]

Title: Prioritized Sequence Experience ReplayComments: 18 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [91] arXiv:1906.05374 (replaced) [pdf, other]

Title: MetaLearning via Learned LossAuthors: Sarah Bechtle, Artem Molchanov, Yevgen Chebotar, Edward Grefenstette, Ludovic Righetti, Gaurav Sukhatme, Franziska MeierSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
 [92] arXiv:1908.05569 (replaced) [pdf, other]

Title: Isotropic Maximization Loss and Entropic Score: Fast, Accurate, Scalable, Unexposed, Turnkey, and Native Neural Networks OutofDistribution DetectionAuthors: David Macêdo, Tsang Ing Ren, Cleber Zanchettin, Adriano L. I. Oliveira, Alain Tapp, Teresa LudermirSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [93] arXiv:1908.06869 (replaced) [pdf, other]

Title: XSP: AcrossStack Profiling and Analysis of Machine Learning Models on GPUsSubjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Performance (cs.PF); Machine Learning (stat.ML)
 [94] arXiv:1909.04823 (replaced) [pdf, other]

Title: Distributed Equivalent Substitution Training for LargeScale Recommender SystemsAuthors: Haidong Rong, Yangzihao Wang, Feihu Zhou, Junjie Zhai, Haiyang Wu, Rui Lan, Fan Li, Han Zhang, Yuekui Yang, Zhenyu Guo, Di WangComments: 10 pagesSubjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
 [95] arXiv:1909.11957 (replaced) [pdf, other]

Title: Drawing earlybird tickets: Towards more efficient training of deep networksAuthors: Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, Yingyan LinComments: Accepted as ICLR2020 SpotlightSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [96] arXiv:1910.09191 (replaced) [pdf, other]

Title: Regularization Matters in Policy OptimizationComments: More analytic experiments and evaluation metrics added on last version. Code link: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [97] arXiv:1910.10196 (replaced) [pdf, other]

Title: Noregret Nonconvex Online MetaLearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [98] arXiv:1910.11858 (replaced) [pdf, other]

Title: BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture SearchSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
 [99] arXiv:1910.12027 (replaced) [pdf, other]

Title: Consistency Regularization for Generative Adversarial NetworksComments: ICLR2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [100] arXiv:1910.13406 (replaced) [pdf, other]

Title: Generalization of Reinforcement Learners with Working and Episodic MemoryAuthors: Meire Fortunato, Melissa Tan, Ryan Faulkner, Steven Hansen, Adrià Puigdomènech Badia, Gavin Buttimore, Charlie Deck, Joel Z Leibo, Charles BlundellComments: NeurIPS 2019. Equal contribution of first 4 authorsJournalref: 33rd Conference on Neural Information Processing Systems (Neurips 2019)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [101] arXiv:1911.05076 (replaced) [pdf, other]

Title: Constant Curvature Graph Convolutional NetworksSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
 [102] arXiv:1911.06922 (replaced) [pdf, other]

Title: Benanza: Automatic $μ$Benchmark Generation to Compute "Lowerbound" Latency and Inform Optimizations of Deep Learning Models on GPUsSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Machine Learning (stat.ML)
 [103] arXiv:1911.09032 (replaced) [pdf, other]

Title: Outside the Box: AbstractionBased Monitoring of Neural NetworksComments: accepted at ECAI 2020Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
 [104] arXiv:1912.05833 (replaced) [pdf, other]

Title: Speechdriven facial animation using polynomial fusion of featuresAuthors: Triantafyllos Kefalas, Konstantinos Vougioukas, Yannis Panagakis, Stavros Petridis, Jean Kossaifi, Maja PanticSubjects: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
 [105] arXiv:1912.06638 (replaced) [pdf, other]

Title: WaLDORf: Wasteless Languagemodel Distillation On ReadingcomprehensionComments: Added Figure, minor edits for claritySubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
 [106] arXiv:1912.09818 (replaced) [pdf, other]

Title: When Explanations Lie: Why Many Modified BP Attributions FailComments: 18 pages, 10 figures. PreprintSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [107] arXiv:1912.09855 (replaced) [pdf, other]

Title: Explainability and Adversarial Robustness for RNNsComments: Accepted at IEEE BigDataService 2020Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI); Machine Learning (stat.ML)
 [108] arXiv:2001.00012 (replaced) [pdf]

Title: Differentially Private Mband WaveletBased Mechanisms in Machine Learning EnvironmentsComments: PartTime Research Assistant/Helper: Tony Lee; 49 pages, 20 figures, 1 table, to be published by International Press of BostonSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
 [109] arXiv:2001.01796 (replaced) [pdf, other]

Title: Fair Active LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [110] arXiv:2001.02407 (replaced) [pdf, other]

Title: SPACE: Unsupervised ObjectOriented Scene Representation via Spatial Attention and DecompositionAuthors: Zhixuan Lin, YiFu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, Sungjin AhnComments: Accepted in ICLR 2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
 [111] arXiv:2001.08456 (replaced) [pdf, other]

Title: AdaLISTA: Learned Solvers Adaptive to Varying ModelsSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [112] arXiv:2002.02561 (replaced) [pdf, other]

Title: Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [113] arXiv:2002.02705 (replaced) [pdf, other]

Title: Trust Your Model: Iterative Label Improvement and Robust Training by Confidence Based Filtering and Dataset PartitioningSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [114] arXiv:2002.02950 (replaced) [pdf, ps, other]

Title: Logistic Regression Regret: What's the Catch?Authors: Gil I. ShamirSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [115] arXiv:2002.03461 (replaced) [pdf, other]

Title: Relation Embedding for Personalised POI RecommendationComments: 12 pages, 3 figures, Accepted in the 24th PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD 2020)Subjects: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (stat.ML)
 [116] arXiv:2002.03847 (replaced) [pdf, other]

Title: Making Logic Learnable With Neural NetworksAuthors: Tobias Brudermueller, Dennis L. Shung, Loren Laine, Adrian J. Stanley, Stig B. Laursen, Harry R. Dalton, Jeffrey Ngu, Michael Schultz, Johannes Stegmaier, Smita KrishnaswamySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
 [117] arXiv:2002.05227 (replaced) [pdf, other]

Title: Variational Autoencoders with Riemannian Brownian Motion PriorsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [118] arXiv:2002.05706 (replaced) [pdf, other]

Title: Sequential Cooperative Bayesian InferenceComments: 28 pages, 22 figures, submitted to ICML 2020Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
 [119] arXiv:2002.07684 (replaced) [pdf, ps, other]

Title: A Lagrangian Approach to Information Propagation in Graph Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [120] arXiv:2002.07766 (replaced) [pdf, other]

Title: Learning Bijective Feature Maps for Linear ICAComments: 8 pagesSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [121] arXiv:1811.03862 (replaced) [pdf, other]

Title: Targeting Solutions in Bayesian MultiObjective Optimization: Sequential and Batch VersionsJournalref: Annals of Mathematics and Artificial Intelligence volume 88, pages 187212(2020)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
 [122] arXiv:1811.06026 (replaced) [pdf, ps, other]

Title: Incentivizing Exploration with Selective Data DisclosureSubjects: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
 [123] arXiv:1812.03894 (replaced) [pdf, other]

Title: ModelBased Learning of Turbulent Flows using a Mobile RobotComments: 21 pages, 26 figuresSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [124] arXiv:1812.04103 (replaced) [pdf, other]

Title: Nonlocal UNet for Biomedical Image SegmentationComments: In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 2019Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
 [125] arXiv:1812.09747 (replaced) [pdf, other]

Title: Let Me Not Lie: Learning MultiNomial LogitComments: 33 pages, 12 tables, 6 figures, +10 p. AppendixSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [126] arXiv:1901.00409 (replaced) [pdf, other]

Title: Neural Clustering ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [127] arXiv:1901.10787 (replaced) [pdf, other]

Title: Tensorized Embedding Layers for Efficient Model CompressionSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
 [128] arXiv:1902.04495 (replaced) [pdf, other]

Title: The Cost of Privacy: Optimal Rates of Convergence for Parameter Estimation with Differential PrivacyComments: 36 pages, 4 figuresSubjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
 [129] arXiv:1904.06744 (replaced) [pdf, ps, other]

Title: A Personalized Preference Learning Framework for Caching in Mobile NetworksComments: 21 pages, 10 figures, 1 table, to appear in the IEEE Transactions on Mobile ComputingSubjects: Networking and Internet Architecture (cs.NI); Information Retrieval (cs.IR); Information Theory (cs.IT); Machine Learning (cs.LG); Multimedia (cs.MM)
 [130] arXiv:1905.00919 (replaced) [pdf, other]

Title: Mimic Learning to Generate a Shareable Network Intrusion Detection ModelAuthors: Ahmed Shafee, Mohamed Baza, Douglas A. Talbert, Mostafa M. Fouda, Mahmoud Nabil, Mohamed MahmoudSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [131] arXiv:1905.04753 (replaced) [pdf, other]

Title: Budgeted Training: Rethinking Deep Neural Network Training Under Resource ConstraintsComments: ICLR 2020. Project page with code is at this http URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
 [132] arXiv:1905.09952 (replaced) [pdf, other]

Title: Fast Algorithms for Computational Optimal Transport and Wasserstein BarycenterComments: 18 pages, 35 figuresSubjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [133] arXiv:1906.06627 (replaced) [pdf, other]

Title: Representation Quality Explains Adversarial AttacksSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
 [134] arXiv:1906.09412 (replaced) [pdf, other]

Title: Multitask Learning for Aggregated Data using Gaussian ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [135] arXiv:1906.11667 (replaced) [pdf, other]

Title: Evolving Robust Neural Architectures to Defend from Adversarial AttacksSubjects: Neural and Evolutionary Computing (cs.NE); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
 [136] arXiv:1907.12160 (replaced) [pdf, ps, other]

Title: Adaptive spline fitting with particle swarm optimizationComments: Expanded literature survey; performance comparison with WaveShrink and smoothing spline; new figures and a table addedSubjects: Computation (stat.CO); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Methodology (stat.ME)
 [137] arXiv:1909.11764 (replaced) [pdf, ps, other]

Title: FreeLB: Enhanced Adversarial Training for Natural Language UnderstandingComments: ICLR 2020Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
 [138] arXiv:1910.01226 (replaced) [pdf, ps, other]

Title: Piracy Resistant Watermarks for Deep Neural NetworksComments: 13 pagesSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [139] arXiv:1910.02757 (replaced) [pdf, other]

Title: Stochastic Bandits with DelayDependent PayoffsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [140] arXiv:1910.04462 (replaced) [pdf, other]

Title: Fast Tree Variants of GromovWassersteinComments: A major revision: (1) improve the complexity of the efficient computation of FlowTGW, (2) add more discussions on treemetric sampling by clusteringbased, (3) add more experiments on larger datasets, and investigate empirical relation for variants of GW, (4) add more details, complexity analysis, and discussions for FlowTGW and DepthTGW, and (5) add some reviews and further discussionsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [141] arXiv:1910.05505 (replaced) [pdf, other]

Title: Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizersComments: 26 pages, proof of the fact that the flow always converges to a critical point (Theorem 10) significantly simplified, numerical section updatedSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
 [142] arXiv:1910.09126 (replaced) [pdf, other]

Title: CommunicationEfficient Local Decentralized SGD MethodsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
 [143] arXiv:1910.14375 (replaced) [pdf, other]

Title: A comparative study of estimating articulatory movements from phoneme sequences and acoustic featuresComments: 5 pages, 5 figures, accepted in ICASSP 2020Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
 [144] arXiv:1911.05146 (replaced) [pdf, other]

Title: HyParFlow: Exploiting MPI and Keras for Scalable HybridParallel DNN Training using TensorFlowComments: 18 pages, 10 figures, Accepted, to be presented at ISC '20Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
 [145] arXiv:1911.07509 (replaced) [pdf, other]

Title: AIbased Pilgrim Detection using Convolutional Neural NetworksComments: Accepted in ATSIP'2020Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
 [146] arXiv:1911.07676 (replaced) [pdf, ps, other]

Title: Learning with Good Feature Representations in Bandits and in RL with a Generative ModelComments: 13 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [147] arXiv:1912.02906 (replaced) [pdf, other]

Title: Scalable Reinforcement Learning of Localized Policies for MultiAgent Networked SystemsComments: Added experimental resultsSubjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
 [148] arXiv:1912.06366 (replaced) [pdf, ps, other]

Title: Provably Efficient Reinforcement Learning with Aggregated StatesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
 [149] arXiv:2001.02004 (replaced) [pdf, other]

Title: CNN 101: Interactive Visual Learning for Convolutional Neural NetworksAuthors: Zijie J. Wang, Robert Turko, Omar Shaikh, Haekyu Park, Nilaksh Das, Fred Hohman, Minsuk Kahng, Duen Horng ChauComments: CHI'20 LateBreaking Work (April 2530, 2020), 7 pages, 3 figuresSubjects: HumanComputer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
 [150] arXiv:2001.11897 (replaced) [pdf, other]

Title: Learning Unitaries by Gradient DescentSubjects: Quantum Physics (quantph); Machine Learning (cs.LG); Mathematical Physics (mathph)
 [151] arXiv:2002.02534 (replaced) [pdf, other]

Title: Fast inference of Boosted Decision Trees in FPGAs for particle physicsAuthors: Sioni Summers, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Duc Hoang, Sergo Jindariani, Edward Kreinar, Vladimir Loncar, Jennifer Ngadiuba, Maurizio Pierini, Dylan Rankin, Nhan Tran, Zhenbin WuSubjects: Computational Physics (physics.compph); Instrumentation and Methods for Astrophysics (astroph.IM); Machine Learning (cs.LG); High Energy Physics  Experiment (hepex)
 [152] arXiv:2002.04700 (replaced) [pdf]

Title: A Single RGB Camera Based Gait Analysis with a Mobile TeleRobot for HealthcareAuthors: Ziyang WangSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); HumanComputer Interaction (cs.HC); Machine Learning (cs.LG)
 [153] arXiv:2002.04836 (replaced) [src]

Title: Analysis Of Multi Field Of View Cnn And Attention Cnn On H&E Stained Wholeslide Images On Hepatocellular CarcinomaComments: This paper has been withdrawn by the authors due to need for heavy reviseSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
 [154] arXiv:2002.05145 (replaced) [pdf, other]

Title: Weighted Empirical Risk Minimization: Sample Selection Bias Correction based on Importance SamplingComments: 20 pages, 7 tables and figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [155] arXiv:2002.06177 (replaced) [pdf]

Title: The Next Decade in AI: Four Steps Towards Robust Artificial IntelligenceAuthors: Gary MarcusComments: 5 figuresSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
 [156] arXiv:2002.06707 (replaced) [pdf, other]

Title: Stochastic Normalizing FlowsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Chemical Physics (physics.chemph); Data Analysis, Statistics and Probability (physics.dataan)
 [157] arXiv:2002.07215 (replaced) [pdf, other]

Title: STANNIS: LowPower Acceleration of Deep Neural Network Training Using Computational StorageAuthors: Ali HeydariGorji, Mahdi Torabzadehkashi, Siavash Rezaei, Hossein Bobarshad, Vladimir Alves, Pai H. ChouSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
[ showing up to 2000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, cs, recent, 2002, contact, help (Access key information)