We gratefully acknowledge support from
the Simons Foundation and member institutions.


New submissions

[ total of 121 entries: 1-121 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 15 Nov 19

[1]  arXiv:1911.05754 [pdf, other]
Title: Implicit Hamiltonian Monte Carlo for Sampling Multiscale Distributions
Subjects: Computation (stat.CO); Methodology (stat.ME)

Hamiltonian Monte Carlo (HMC) has been widely adopted in the statistics community because of its ability to sample high-dimensional distributions much more efficiently than other Metropolis-based methods. Despite this, HMC often performs sub-optimally on distributions with high correlations or marginal variances on multiple scales because the resulting stiffness forces the leapfrog integrator in HMC to take an unreasonably small stepsize. We provide intuition as well as a formal analysis showing how these multiscale distributions limit the stepsize of leapfrog and we show how the implicit midpoint method can be used, together with Newton-Krylov iteration, to circumvent this limitation and achieve major efficiency gains. Furthermore, we offer practical guidelines for when to choose between implicit midpoint and leapfrog and what stepsize to use for each method, depending on the distribution being sampled. Unlike previous modifications to HMC, our method is generally applicable to highly non-Gaussian distributions exhibiting multiple scales. We illustrate how our method can provide a dramatic speedup over leapfrog in the context of the No-U-Turn sampler (NUTS) applied to several examples.

[2]  arXiv:1911.05770 [pdf, other]
Title: Constrained Bayesian ICA for Brain Connectome Inference
Subjects: Applications (stat.AP); Neurons and Cognition (q-bio.NC)

Brain connectomics is a developing field in neurosciences which strives to understand cognitive processes and psychiatric diseases through the analysis of interactions between brain regions. However, in the high-dimensional, low-sample, and noisy regimes that typically characterize fMRI data, the recovery of such interactions remains an ongoing challenge: how can we discover patterns of co-activity between brain regions that could then be associated to cognitive processes or psychiatric disorders? In this paper, we investigate a constrained Bayesian ICA approach which, in comparison to current methods, simultaneously allows (a) the flexible integration of multiple sources of information (fMRI, DWI, anatomical, etc.), (b) an automatic and parameter-free selection of the appropriate sparsity level and number of connected submodules and (c) the provision of estimates on the uncertainty of the recovered interactions. Our experiments, both on synthetic and real-life data, validate the flexibility of our method and highlight the benefits of integrating anatomical information for connectome inference.

[3]  arXiv:1911.05822 [pdf, ps, other]
Title: A Model of Double Descent for High-dimensional Binary Linear Classification
Comments: Short version submitted to ICASSP 2020
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)

We consider a model for logistic regression where only a subset of features of size $p$ is used for training a linear classifier over $n$ training samples. The classifier is obtained by running gradient-descent (GD) on the logistic-loss. For this model, we investigate the dependence of the generalization error on the overparameterization ratio $\kappa=p/n$. First, building on known deterministic results on convergence properties of the GD, we uncover a phase-transition phenomenon for the case of Gaussian regressors: the generalization error of GD is the same as that of the maximum-likelihood (ML) solution when $\kappa<\kappa_\star$, and that of the max-margin (SVM) solution when $\kappa>\kappa_\star$. Next, using the convex Gaussian min-max theorem (CGMT), we sharply characterize the performance of both the ML and SVM solutions. Combining these results, we obtain curves that explicitly characterize the generalization error of GD for varying values of $\kappa$. The numerical results validate the theoretical predictions and unveil double-descent phenomena that complement similar recent observations in linear regression settings.

[4]  arXiv:1911.05865 [pdf, other]
Title: Kriging: Beyond Matérn
Subjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)

The Mat\'ern covariance function is a popular choice for prediction in spatial statistics and uncertainty quantification literature. A key benefit of the Mat\'ern class is that it is possible to get precise control over the degree of differentiability of the process realizations. However, the Mat\'ern class possesses exponentially decaying tails, and thus may not be suitable for modeling long range dependence. This problem can be remedied using polynomial covariances; however one loses control over the degree of differentiability of the process realizations, in that the realizations using polynomial covariances are either infinitely differentiable or not differentiable at all. We construct a new family of covariance functions using a scale mixture representation of the Mat\'ern class where one obtains the benefits of both Mat\'ern and polynomial covariances. The resultant covariance contains two parameters: one controls the degree of differentiability near the origin and the other controls the tail heaviness, independently of each other. Using a spectral representation, we derive theoretical properties of this new covariance including equivalence measures and asymptotic behavior of the maximum likelihood estimators under infill asymptotics. The improved theoretical properties in predictive performance of this new covariance class are verified via extensive simulations. Application using NASA's Orbiting Carbon Observatory-2 satellite data confirms the advantage of this new covariance class over the Mat\'ern class, especially in extrapolative settings.

[5]  arXiv:1911.05881 [pdf, other]
Title: Projecting Flood-Inducing Precipitation with a Bayesian Analogue Model
Subjects: Applications (stat.AP)

The hazard of pluvial flooding is largely influenced by the spatial and temporal dependence characteristics of precipitation. When extreme precipitation possesses strong spatial dependence, the risk of flooding is amplified due to catchment factors that cause runoff accumulation such as topography. Temporal dependence can also increase flood risk as storm water drainage systems operating at capacity can be overwhelmed by heavy precipitation occurring over multiple days. While transformed Gaussian processes are common choices for modeling precipitation, their weak tail dependence may lead to underestimation of flood risk. Extreme value models such as the generalized Pareto processes for threshold exceedances and max-stable models are attractive alternatives, but are difficult to fit when the number of observation sites is large, and are of little use for modeling the bulk of the distribution, which may also be of interest to water management planners. While the atmospheric dynamics governing precipitation are complex and difficult to fully incorporate into a parsimonious statistical model, non-mechanistic analogue methods that approximate those dynamics have proven to be promising approaches to capturing the temporal dependence of precipitation. In this paper, we present a Bayesian analogue method that leverages large, synoptic-scale atmospheric patterns to make precipitation forecasts. Changing spatial dependence across varying intensities is modeled as a mixture of spatial Student-t processes that can accommodate both strong and weak tail dependence. The proposed model demonstrates improved performance at capturing the distribution of extreme precipitation over Community Atmosphere Model (CAM) 5.2 forecasts.

[6]  arXiv:1911.05934 [pdf, other]
Title: Bayesian Optimization with Uncertain Preferences over Attributes
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)

We consider black-box global optimization of time-consuming-to-evaluate functions on behalf of a decision-maker whose preferences must be learned. Each feasible design is associated with a time-consuming-to-evaluate vector of attributes, each vector of attributes is assigned a utility by the decision-maker's utility function, and this utility function may be learned approximately using preferences expressed by the decision-maker over pairs of attribute vectors. Past work has used this estimated utility function as if it were error-free within single-objective optimization. However, errors in utility estimation may yield a poor suggested decision. Furthermore, this approach produces a single suggested "best" design, whereas decision-makers often prefer to choose among a menu of designs. We propose a novel Bayesian optimization algorithm that acknowledges the uncertainty in preference estimation and implicitly chooses designs to evaluate using the time-consuming function that are good not just for a single estimated utility function but a range of likely utility functions. Our algorithm then shows a menu of designs and evaluated attributes to the decision-maker who makes a final selection. We demonstrate the value of our algorithm in a variety of numerical experiments.

[7]  arXiv:1911.05940 [pdf, other]
Title: Distributional Clustering: A distribution-preserving clustering method
Comments: Submitted to Statistica Sinica
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

One key use of k-means clustering is to identify cluster prototypes which can serve as representative points for a dataset. However, a drawback of using k-means cluster centers as representative points is that such points distort the distribution of the underlying data. This can be highly disadvantageous in problems where the representative points are subsequently used to gain insights on the data distribution, as these points do not mimic the distribution of the data. To this end, we propose a new clustering method called "distributional clustering", which ensures cluster centers capture the distribution of the underlying data. We first prove the asymptotic convergence of the proposed cluster centers to the data generating distribution, then present an efficient algorithm for computing these cluster centers in practice. Finally, we demonstrate the effectiveness of distributional clustering on synthetic and real datasets.

[8]  arXiv:1911.05970 [pdf, other]
Title: Empirical Bayes mean estimation with nonparametric errors via order statistic regression
Subjects: Methodology (stat.ME)

We study empirical Bayes estimation of the effect sizes of $N$ units from $K$ noisy observations on each unit. We show that it is possible to achieve near-Bayes optimal mean squared error, without any assumptions or knowledge about the effect size distribution or the noise. The noise distribution can be heteroskedastic and vary arbitrarily from unit to unit. Our proposal, which we call Aurora, leverages the replication inherent in the $K$ observations per unit and recasts the effect size estimation problem as a general regression problem. Aurora with linear regression provably matches the performance of a wide array of estimators including the sample mean, the trimmed mean, the sample median, as well as James-Stein shrunk versions thereof. Aurora automates effect size estimation for Internet-scale datasets, as we demonstrate on Google data.

[9]  arXiv:1911.06006 [pdf, other]
Title: An Invariant Test for Equality of Two Large Scale Covariance Matrices
Subjects: Statistics Theory (math.ST)

In this work, we are motivated by the recent work of Zhang et al. (2019) and study a new invariant test for equality of two large scale covariance matrices. Two modified likelihood ratio tests (LRTs) by Zhang et al. (2019) are based on the sum of log of eigenvalues (or 1- eigenvalues) of the Beta-matrix. However, as the dimension increases, many eigenvalues of the Beta-matrix are close to 0 or 1 and the modified LRTs are greatly influenced by them. In this work, instead, we consider the simple sum of the eigenvalues (of the Beta-matrix) and compute its asymptotic normality when all $n_1, n_2, p$ increase at the same rate. We numerically show that our test has higher power than two modified likelihood ratio tests by Zhang et al. (2019) in all cases both we and they consider.

[10]  arXiv:1911.06030 [pdf]
Title: Guidelines for estimating causal effects in pragmatic randomized trials
Subjects: Methodology (stat.ME)

Pragmatic randomized trials are designed to provide evidence for clinical decision-making rather than regulatory approval. Common features of these trials include the inclusion of heterogeneous or diverse patient populations in a wide range of care settings, the use of active treatment strategies as comparators, unblinded treatment assignment, and the study of long-term, clinically relevant outcomes. These features can greatly increase the usefulness of the trial results for patients, clinicians, and other stakeholders. However, these features also introduce an increased risk of non-adherence, which reduces the value of the intention-to-treat effect as a patient-centered measure of causal effect. In these settings, the per-protocol effect provides useful complementary information for decision making. Unfortunately, there is little guidance for valid estimation of the per-protocol effect. Here, we present our full guidelines for analyses of pragmatic trials that will result in more informative causal inferences for both the intention-to-treat effect and the per-protocol effect.

[11]  arXiv:1911.06177 [pdf, other]
Title: Uncertainty Quantification in Ensembles of Honest Regression Trees using Generalized Fiducial Inference
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)

Due to their accuracies, methods based on ensembles of regression trees are a popular approach for making predictions. Some common examples include Bayesian additive regression trees, boosting and random forests. This paper focuses on honest random forests, which add honesty to the original form of random forests and are proved to have better statistical properties. The main contribution is a new method that quantifies the uncertainties of the estimates and predictions produced by honest random forests. The proposed method is based on the generalized fiducial methodology, and provides a fiducial density function that measures how likely each single honest tree is the true model. With such a density function, estimates and predictions, as well as their confidence/prediction intervals, can be obtained. The promising empirical properties of the proposed method are demonstrated by numerical comparisons with several state-of-the-art methods, and by applications to a few real data sets. Lastly, the proposed method is theoretically backed up by a strong asymptotic guarantee.

[12]  arXiv:1911.06213 [pdf, other]
Title: Analysis of the fiber laydown quality in spunbond processes with simulation experiments evaluated by blocked neural networks
Comments: 12 pages, 23 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We present a simulation framework for spunbond processes and use a design of experiments to investigate the cause-and-effect-relations of process and material parameters onto the fiber laydown on a conveyor belt. The virtual experiments are analyzed by a blocked neural network. This forms the basis for the prediction of the fiber laydown characteristics and enables a quick ranking of the significance of the influencing effects. We conclude our research by an analysis of the nonlinear cause-and-effect relations.

[13]  arXiv:1911.06215 [pdf, other]
Title: Sparse Density Estimation with Measurement Errors
Comments: 32 pages, 4 figures
Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG)

This paper aims to build an estimate of an unknown density of the data with measurement error as a linear combination of functions of a dictionary. Inspired by penalization approach, we propose the weighted Elastic-net penalized minimal $L_2$-distance method for sparse coefficients estimation, where the weights adaptively coming from sharp concentration inequalities. The optimal weighted tuning parameters are obtained by the first-order conditions holding with high-probability. Under local coherence or minimal eigenvalue assumptions, non-asymptotical oracle inequalities are derived. These theoretical results are transposed to obtain the support recovery with high-probability. Then, the issue of calibrating these procedures is studied by some numerical experiments for discrete and continuous distributions, it shows the significant improvement obtained by our procedure when compared with other conventional approaches. Finally, the application is performed for a meteorology data set. It shows that our method has potency and superiority of detecting the shape of multi-mode density compared with other conventional approaches.

[14]  arXiv:1911.06225 [pdf, other]
Title: Location estimation for symmetric log-concave densities
Authors: Nilanjana Laha
Subjects: Statistics Theory (math.ST)

We revisit the problem of estimating the center of symmetry $\theta$ of an unknown symmetric density $f$. Although Stone (1975), Van Eden (1970), and Sacks (1975) constructed adaptive estimators of $\theta$ in this model, their estimators depend on tuning parameters. In an effort to circumvent the dependence on tuning parameters, we impose an additional assumption of log-concavity on $f$. We show that in this shape-restricted model, the maximum likelihood estimator (MLE) of $\theta$ exists. We also study some truncated one-step estimators and show that they are $\sqrt{n}-$consistent, and nearly achieve the asymptotic efficiency bound. We also show that the rate of convergence for the MLE is $O_p(n^{-2/5})$. Furthermore, we show that our estimators are robust with respect to the violation of the log-concavity assumption. In fact, we show that the one step estimators are still $\sqrt{n}$-consistent under some mild conditions. These analytical conclusions are supported by simulation studies.

[15]  arXiv:1911.06239 [pdf, other]
Title: Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems
Comments: 4 pages, 4 figures, Aditya Narayan Ravi and Pranav Poduval have equal contribution
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)

We use a novel modification of Multi-Armed Bandits to create a new model for recommendation systems. We model the recommendation system as a bandit seeking to maximize reward by pulling on arms with unknown rewards. The catch however is that this bandit can only access these arms through an unreliable intermediate that has some level of autonomy while choosing its arms. For example, in a streaming website the user has a lot of autonomy while choosing content they want to watch. The streaming sites can use targeted advertising as a means to bias opinions of these users. Here the streaming site is the bandit aiming to maximize reward and the user is the unreliable intermediate. We model the intermediate as accessing states via a Markov chain. The bandit is allowed to perturb this Markov chain. We prove fundamental theorems for this setting after which we show a close-to-optimal Explore-Commit algorithm.

[16]  arXiv:1911.06253 [pdf, ps, other]
Title: Understanding Graph Neural Networks with Asymmetric Geometric Scattering Transforms
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

The scattering transform is a multilayered wavelet-based deep learning architecture that acts as a model of convolutional neural networks. Recently, several works have introduced generalizations of the scattering transform for non-Euclidean settings such as graphs. Our work builds upon these constructions by introducing windowed and non-windowed graph scattering transforms based upon a very general class of asymmetric wavelets. We show that these asymmetric graph scattering transforms have many of the same theoretical guarantees as their symmetric counterparts. This work helps bridge the gap between scattering and other graph neural networks by introducing a large family of networks with provable stability and invariance guarantees. This lays the groundwork for future deep learning architectures for graph-structured data that have learned filters and also provably have desirable theoretical properties.

[17]  arXiv:1911.06287 [pdf, other]
Title: Scalable Exact Inference in Multi-Output Gaussian Processes
Comments: 19 pages, 9 figures, includes appendix
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Multi-output Gaussian processes (MOGPs) leverage the flexibility and interpretability of GPs while capturing structure across outputs, which is desirable, for example, in spatio-temporal modelling. The key problem with MOGPs is the cubic computational scaling in the number of both inputs (e.g., time points or locations), n, and outputs, p. Current methods reduce this to O(n^3 m^3), where m < p is the desired degrees of freedom. This computational cost, however, is still prohibitive in many applications. To address this limitation, we present the Orthogonal Linear Mixing Model (OLMM), an MOGP in which exact inference scales linearly in m: O(n^3 m). This advance opens up a wide range of real-world tasks and can be combined with existing GP approximations in a plug-and-play way as demonstrated in the paper. Additionally, the paper organises the existing disparate literature on MOGP models into a simple taxonomy called the Mixing Model Hierarchy (MMH).

[18]  arXiv:1911.06302 [pdf, other]
Title: rFIA: An R package for space-time estimation of forest attributes with the Forest Inventory and Analysis Database
Subjects: Applications (stat.AP)

rFIA is an R package designed to simplify the estimation of forest attributes using the USDA Forest Service Forest Inventory and Analysis (FIA) Database. Specifically, rFIA improves accessibility to the spatio-temporal estimation capacity of the FIA Database via space-time indexed summaries of forest variables within user-defined population boundaries. Direct integration with other popular R packages (e.g., dplyr, sf, and parallel) facilitates efficient space-time query and data summary, and supports common data representations and application programming interface (API). The package implements design-based estimation procedures used by the FIA Program, and has been validated against official estimates and sampling errors produced by the FIA Program. We demonstrate the utility of rFIA by assessing changes in abundance and mortality rates of ash populations in the lower peninsula of Michigan following the establishment of emerald ash borer.

Cross-lists for Fri, 15 Nov 19

[19]  arXiv:1911.05774 (cross-list from cs.LG) [pdf, ps, other]
Title: Factor Group-Sparse Regularization for Efficient Low-Rank Matrix Recovery
Comments: Accepted by NeurIPS 2019
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

This paper develops a new class of nonconvex regularizers for low-rank matrix recovery. Many regularizers are motivated as convex relaxations of the matrix rank function. Our new factor group-sparse regularizers are motivated as a relaxation of the number of nonzero columns in a factorization of the matrix. These nonconvex regularizers are sharper than the nuclear norm; indeed, we show they are related to Schatten-$p$ norms with arbitrarily small $0 < p \leq 1$. Moreover, these factor group-sparse regularizers can be written in a factored form that enables efficient and effective nonconvex optimization; notably, the method does not use singular value decomposition. We provide generalization error bounds for low-rank matrix completion which show improved upper bounds for Schatten-$p$ norm reglarization as $p$ decreases. Compared to the max norm and the factored formulation of the nuclear norm, factor group-sparse regularizers are more efficient, accurate, and robust to the initial guess of rank. Experiments show promising performance of factor group-sparse regularization for low-rank matrix completion and robust principal component analysis.

[20]  arXiv:1911.05781 (cross-list from cs.LG) [pdf, ps, other]
Title: Learning internal representations
Authors: Jonathan Baxter
Journal-ref: COLT '95 Proceedings of the eighth annual conference on Computational learning theory (1995) 311-320
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Probably the most important problem in machine learning is the preliminary biasing of a learner's hypothesis space so that it is small enough to ensure good generalisation from reasonable training sets, yet large enough that it contains a good solution to the problem being learnt. In this paper a mechanism for {\em automatically} learning or biasing the learner's hypothesis space is introduced. It works by first learning an appropriate {\em internal representation} for a learning environment and then using that representation to bias the learner's hypothesis space for the learning of future tasks drawn from the same environment.
An internal representation must be learnt by sampling from {\em many similar tasks}, not just a single task as occurs in ordinary machine learning. It is proved that the number of examples $m$ {\em per task} required to ensure good generalisation from a representation learner obeys $m = O(a+b/n)$ where $n$ is the number of tasks being learnt and $a$ and $b$ are constants. If the tasks are learnt independently ({\em i.e.} without a common representation) then $m=O(a+b)$. It is argued that for learning environments such as speech and character recognition $b\gg a$ and hence representation learning in these environments can potentially yield a drastic reduction in the number of examples required per task. It is also proved that if $n = O(b)$ (with $m=O(a+b/n)$) then the representation learnt will be good for learning novel tasks from the same environment, and that the number of examples required to generalise well on a novel task will be reduced to $O(a)$ (as opposed to $O(a+b)$ if no representation is used).
It is shown that gradient descent can be used to train neural network representations and experiment results are reported providing strong qualitative support for the theoretical results.

[21]  arXiv:1911.05806 (cross-list from cs.LG) [pdf, other]
Title: Coarse-Refinement Dilemma: On Generalization Bounds for Data Clustering
Comments: 52 pages (in which 5 pages contain references, 1 contains notation, 1 contains dictionary of terms, 2 contain proofs, 5 contain dataset images and 7 contain results)
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

The Data Clustering (DC) problem is of central importance for the area of Machine Learning (ML), given its usefulness to represent data structural similarities from input spaces. Differently from Supervised Machine Learning (SML), which relies on the theoretical frameworks of the Statistical Learning Theory (SLT) and the Algorithm Stability (AS), DC has scarce literature on general-purpose learning guarantees, affecting conclusive remarks on how those algorithms should be designed as well as on the validity of their results. In this context, this manuscript introduces a new concept, based on multidimensional persistent homology, to analyze the conditions on which a clustering model is capable of generalizing data. As a first step, we propose a more general definition of DC problem by relying on Topological Spaces, instead of metric ones as typically approached in the literature. From that, we show that the DC problem presents an analogous dilemma to the Bias-Variance one, which is here referred to as the Coarse-Refinement (CR) dilemma. CR is intended to clarify the contrast between: (i) highly-refined partitions and the clustering instability (overfitting); and (ii) over-coarse partitions and the lack of representativeness (underfitting); consequently, the CR dilemma suggests the need of a relaxation of Kleinberg's richness axiom. Experimental results were used to illustrate that multidimensional persistent homology support the measurement of divergences among DC models, leading to a consistency criterion.

[22]  arXiv:1911.05811 (cross-list from cs.LG) [pdf, other]
Title: Triply Robust Off-Policy Evaluation
Comments: Preliminary Work
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We propose a robust regression approach to off-policy evaluation (OPE) for contextual bandits. We frame OPE as a covariate-shift problem and leverage modern robust regression tools. Ours is a general approach that can be used to augment any existing OPE method that utilizes the direct method. When augmenting doubly robust methods, we call the resulting method Triply Robust. We prove upper bounds on the resulting bias and variance, as well as derive novel minimax bounds based on robust minimax analysis for covariate shift. Our robust regression method is compatible with deep learning, and is thus applicable to complex OPE settings that require powerful function approximators. Finally, we demonstrate superior empirical performance across the standard OPE benchmarks, especially in the case where the logging policy is unknown and must be estimated from data.

[23]  arXiv:1911.05815 (cross-list from cs.LG) [pdf, other]
Title: Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We present an algorithm, HOMER, for exploration and reinforcement learning in rich observation environments that are summarizable by an unknown latent state space. The algorithm interleaves representation learning to identify a new notion of kinematic state abstraction with strategic exploration to reach new states using the learned abstraction. The algorithm provably explores the environment with sample complexity scaling polynomially in the number of latent states and the time horizon, and, crucially, with no dependence on the size of the observation space, which could be infinitely large. This exploration guarantee further enables sample-efficient global policy optimization for any reward function. On the computational side, we show that the algorithm can be implemented efficiently whenever certain supervised learning problems are tractable. Empirically, we evaluate HOMER on a challenging exploration problem, where we show that the algorithm is exponentially more sample efficient than standard reinforcement learning baselines.

[24]  arXiv:1911.05843 (cross-list from cs.LG) [pdf, other]
Title: TASTE: Temporal and Static Tensor Factorization for Phenotyping Electronic Health Records
Comments: 19 pages
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Phenotyping electronic health records (EHR) focuses on defining meaningful patient groups (e.g., heart failure group and diabetes group) and identifying the temporal evolution of patients in those groups. Tensor factorization has been an effective tool for phenotyping. Most of the existing works assume either a static patient representation with aggregate data or only model temporal data. However, real EHR data contain both temporal (e.g., longitudinal clinical visits) and static information (e.g., patient demographics), which are difficult to model simultaneously. In this paper, we propose Temporal And Static TEnsor factorization (TASTE) that jointly models both static and temporal information to extract phenotypes. TASTE combines the PARAFAC2 model with non-negative matrix factorization to model a temporal and a static tensor. To fit the proposed model, we transform the original problem into simpler ones which are optimally solved in an alternating fashion. For each of the sub-problems, our proposed mathematical reformulations lead to efficient sub-problem solvers. Comprehensive experiments on large EHR data from a heart failure (HF) study confirmed that TASTE is up to 14x faster than several baselines and the resulting phenotypes were confirmed to be clinically meaningful by a cardiologist. Using 80 phenotypes extracted by TASTE, a simple logistic regression can achieve the same level of area under the curve (AUC) for HF prediction compared to a deep learning model using recurrent neural networks (RNN) with 345 features.

[25]  arXiv:1911.05861 (cross-list from cs.LG) [pdf, other]
Title: Federated and Differentially Private Learning for Electronic Health Records
Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

The use of collaborative and decentralized machine learning techniques such as federated learning have the potential to enable the development and deployment of clinical risk predictions models in low-resource settings without requiring sensitive data be shared or stored in a central repository. This process necessitates communication of model weights or updates between collaborating entities, but it is unclear to what extent patient privacy is compromised as a result. To gain insight into this question, we study the efficacy of centralized versus federated learning in both private and non-private settings. The clinical prediction tasks we consider are the prediction of prolonged length of stay and in-hospital mortality across thirty one hospitals in the eICU Collaborative Research Database. We find that while it is straightforward to apply differentially private stochastic gradient descent to achieve strong privacy bounds when training in a centralized setting, it is considerably more difficult to do so in the federated setting.

[26]  arXiv:1911.05873 (cross-list from cs.LG) [pdf, ps, other]
Title: A Reduction from Reinforcement Learning to No-Regret Online Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We present a reduction from reinforcement learning (RL) to no-regret online learning based on the saddle-point formulation of RL, by which "any" online algorithm with sublinear regret can generate policies with provable performance guarantees. This new perspective decouples the RL problem into two parts: regret minimization and function approximation. The first part admits a standard online-learning analysis, and the second part can be quantified independently of the learning algorithm. Therefore, the proposed reduction can be used as a tool to systematically design new RL algorithms. We demonstrate this idea by devising a simple RL algorithm based on mirror descent and the generative-model oracle. For any $\gamma$-discounted tabular RL problem, with probability at least $1-\delta$, it learns an $\epsilon$-optimal policy using at most $\tilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\log(\frac{1}{\delta})}{(1-\gamma)^4\epsilon^2}\right)$ samples. Furthermore, this algorithm admits a direct extension to linearly parameterized function approximators for large-scale applications, with computation and sample complexities independent of $|\mathcal{S}|$,$|\mathcal{A}|$, though at the cost of potential approximation bias.

[27]  arXiv:1911.05887 (cross-list from cs.LG) [pdf, other]
Title: Revenue Maximization of Airbnb Marketplace using Search Results
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Correctly pricing products or services in an online marketplace presents a challenging problem and one of the critical factors for the success of the business. When users are looking to buy an item they typically search for it. Query relevance models are used at this stage to retrieve and rank the items on the search page from most relevant to least relevant. The presented items are naturally "competing" against each other for user purchases. We provide a practical two-stage model to price this set of retrieved items for which distributions of their values are learned. The initial output of the pricing strategy is a price vector for the top displayed items in one search event. We later aggregate these results over searches to provide the supplier with the optimal price for each item. We applied our solution to large-scale search data obtained from Airbnb Experiences marketplace. Offline evaluation results show that our strategy improves upon baseline pricing strategies on key metrics by at least +20% in terms of booking regret and +55% in terms of revenue potential.

[28]  arXiv:1911.05894 (cross-list from cs.SD) [pdf, other]
Title: Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
Comments: This extended version of a ICASSP 2020 submission under same title has an added figure and additional discussion for easier consumption
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)

Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate categories into relevant semantic classes. By training a combined sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate up to a 20-fold reduction in the number of labels required to reach a desired classification performance.

[29]  arXiv:1911.05904 (cross-list from cs.LG) [pdf, other]
Title: There is Limited Correlation between Coverage and Robustness for Deep Neural Networks
Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE); Machine Learning (stat.ML)

Deep neural networks (DNN) are increasingly applied in safety-critical systems, e.g., for face recognition, autonomous car control and malware detection. It is also shown that DNNs are subject to attacks such as adversarial perturbation and thus must be properly tested. Many coverage criteria for DNN since have been proposed, inspired by the success of code coverage criteria for software programs. The expectation is that if a DNN is a well tested (and retrained) according to such coverage criteria, it is more likely to be robust. In this work, we conduct an empirical study to evaluate the relationship between coverage, robustness and attack/defense metrics for DNN. Our study is the largest to date and systematically done based on 100 DNN models and 25 metrics. One of our findings is that there is limited correlation between coverage and robustness, i.e., improving coverage does not help improve the robustness. Our dataset and implementation have been made available to serve as a benchmark for future studies on testing DNN.

[30]  arXiv:1911.05909 (cross-list from cs.LG) [pdf, other]
Title: Explainable Ordinal Factorization Model: Deciphering the Effects of Attributes by Piece-wise Linear Approximation
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Ordinal regression predicts the objects' labels that exhibit a natural ordering, which is important to many managerial problems such as credit scoring and clinical diagnosis. In these problems, the ability to explain how the attributes affect the prediction is critical to users. However, most, if not all, existing ordinal regression models simplify such explanation in the form of constant coefficients for the main and interaction effects of individual attributes. Such explanation cannot characterize the contributions of attributes at different value scales. To address this challenge, we propose a new explainable ordinal regression model, namely, the Explainable Ordinal Factorization Model (XOFM). XOFM uses the piece-wise linear functions to approximate the actual contributions of individual attributes and their interactions. Moreover, XOFM introduces a novel ordinal transformation process to assign each object the probabilities of belonging to multiple relevant classes, instead of fixing boundaries to differentiate classes. XOFM is based on the Factorization Machines to handle the potential sparsity problem as a result of discretizing the attribute scales. Comprehensive experiments with benchmark datasets and baseline models demonstrate that the proposed XOFM exhibits superior explainability and leads to state-of-the-art prediction accuracy.

[31]  arXiv:1911.05911 (cross-list from cs.DS) [pdf, ps, other]
Title: Recent Advances in Algorithmic High-Dimensional Robust Statistics
Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Statistics Theory (math.ST); Machine Learning (stat.ML)

Learning in the presence of outliers is a fundamental problem in statistics. Until recently, all known efficient unsupervised learning algorithms were very sensitive to outliers in high dimensions. In particular, even for the task of robust mean estimation under natural distributional assumptions, no efficient algorithm was known. Recent work in theoretical computer science gave the first efficient robust estimators for a number of fundamental statistical tasks, including mean and covariance estimation. Since then, there has been a flurry of research activity on algorithmic high-dimensional robust estimation in a range of settings. In this survey article, we introduce the core ideas and algorithmic techniques in the emerging area of algorithmic high-dimensional robust statistics with a focus on robust mean estimation. We also provide an overview of the approaches that have led to computationally efficient robust estimators for a range of broader statistical tasks and discuss new directions and opportunities for future work.

[32]  arXiv:1911.05916 (cross-list from cs.LG) [pdf, other]
Title: Adversarial Margin Maximization Networks
Comments: 11 pages + 1 page appendix, accepted by T-PAMI
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

The tremendous recent success of deep neural networks (DNNs) has sparked a surge of interest in understanding their predictive ability. Unlike the human visual system which is able to generalize robustly and learn with little supervision, DNNs normally require a massive amount of data to learn new concepts. In addition, research works also show that DNNs are vulnerable to adversarial examples-maliciously generated images which seem perceptually similar to the natural ones but are actually formed to fool learning models, which means the models have problem generalizing to unseen data with certain type of distortions. In this paper, we analyze the generalization ability of DNNs comprehensively and attempt to improve it from a geometric point of view. We propose adversarial margin maximization (AMM), a learning-based regularization which exploits an adversarial perturbation as a proxy. It encourages a large margin in the input space, just like the support vector machines. With a differentiable formulation of the perturbation, we train the regularized DNNs simply through back-propagation in an end-to-end manner. Experimental results on various datasets (including MNIST, CIFAR-10/100, SVHN and ImageNet) and different DNN architectures demonstrate the superiority of our method over previous state-of-the-arts. Code and models for reproducing our results will be made publicly available.

[33]  arXiv:1911.05922 (cross-list from cs.LG) [pdf, other]
Title: Atari-fying the Vehicle Routing Problem with Stochastic Service Requests
Comments: 11 pages, 4 figures
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

We present a new general approach to modeling research problems as Atari-like videogames to make them amenable to recent groundbreaking solution methods from the deep reinforcement learning community. The approach is flexible, applicable to a wide range of problems. We demonstrate its application on a well known vehicle routing problem. Our preliminary results on this problem, though not transformative, show signs of success and suggest that Atari-fication may be a useful modeling approach for researchers studying problems involving sequential decision making under uncertainty.

[34]  arXiv:1911.05941 (cross-list from cs.LG) [pdf, other]
Title: An Efficient Hardware-Oriented Dropout Algorithm
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

This paper proposes a hardware-oriented dropout algorithm, which is efficient for field programmable gate array (FPGA) implementation. In deep neural networks (DNNs), overfitting occurs when networks are overtrained and adapt too well to training data. Consequently, they fail in predicting unseen data used as test data. Dropout is a common technique that is often applied in DNNs to overcome this problem. In general, implementing such training algorithms of DNNs in embedded systems is difficult due to power and memory constraints. Training DNNs is power-, time-, and memory- intensive; however, embedded systems require low power consumption and real-time processing. An FPGA is suitable for embedded systems for its parallel processing characteristic and low operating power; however, due to its limited memory and different architecture, it is difficult to apply general neural network algorithms. Therefore, we propose a hardware-oriented dropout algorithm that can effectively utilize the characteristics of an FPGA with less memory required. Software program verification demonstrates that the performance of the proposed method is identical to that of conventional dropout, and hardware synthesis demonstrates that it results in significant resource reduction.

[35]  arXiv:1911.05942 (cross-list from cs.CV) [pdf, other]
Title: Progressive Feature Polishing Network for Salient Object Detection
Comments: Accepted by AAAI 2020
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

Feature matters for salient object detection. Existing methods mainly focus on designing a sophisticated structure to incorporate multi-level features and filter out cluttered features. We present Progressive Feature Polishing Network (PFPN), a simple yet effective framework to progressively polish the multi-level features to be more accurate and representative. By employing multiple Feature Polishing Modules (FPMs) in a recurrent manner, our approach is able to detect salient objects with fine details without any post-processing. A FPM parallelly updates the features of each level by directly incorporating all higher level context information. Moreover, it can keep the dimensions and hierarchical structures of the feature maps, which makes it flexible to be integrated with any CNN-based models. Empirical experiments show that our results are monotonically getting better with increasing number of FPMs. Without bells and whistles, PFPN outperforms the state-of-the-art methods significantly on five benchmark datasets under various evaluation metrics.

[36]  arXiv:1911.05944 (cross-list from cs.LG) [pdf, other]
Title: 2L-3W: 2-Level 3-Way Hardware-Software Co-Verification for the Mapping of Deep Learning Architecture (DLA) onto FPGA Boards
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

FPGAs have become a popular choice for deploying deep learning architectures (DLA). There are many researchers that have explored the deployment and mapping of DLA on FPGA. However, there has been a growing need to do design-time hardware-software co-verification of these deployments. To the best of our knowledge this is the first work that proposes a 2-Level 3-Way (2L-3W) hardware-software co-verification methodology and provides a step-by-step guide for the successful mapping, deployment and verification of DLA on FPGA boards. The 2-Level verification is to make sure the implementation in each stage (software and hardware) are following the desired behavior. The 3-Way co-verification provides a cross-paradigm (software, design and hardware) layer-by-layer parameter check to assure the correct implementation and mapping of the DLA onto FPGA boards. The proposed 2L-3W co-verification methodology has been evaluated over several test cases. In each case, the prediction and layer-by-layer output of the DLA deployed on PYNQ FPGA board (hardware) alongside with the intermediate design results of the layer-by-layer output of the DLA implemented on Vivado HLS and the prediction and layer-by-layer output of the software level (Caffe deep learning framework) are compared to obtain a layer-by-layer similarity score. The comparison is achieved using a completely automated Python script. The comparison provides a layer-by-layer similarity score that informs us the degree of success of the DLA mapping to the FPGA or help identify in design time the layer to be debugged in the case of unsuccessful mapping. We demonstrated our technique on LeNet DLA and Caffe inspired Cifar-10 DLA and the co-verification results yielded layer-by-layer similarity scores of 99\% accuracy.

[37]  arXiv:1911.05949 (cross-list from cs.LG) [pdf, ps, other]
Title: Online Second Price Auction with Semi-bandit Feedback Under the Non-Stationary Setting
Authors: Haoyu Zhao, Wei Chen
Comments: Accepted to AAAI-20
Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)

In this paper, we study the non-stationary online second price auction problem. We assume that the seller is selling the same type of items in $T$ rounds by the second price auction, and she can set the reserve price in each round. In each round, the bidders draw their private values from a joint distribution unknown to the seller. Then, the seller announced the reserve price in this round. Next, bidders with private values higher than the announced reserve price in that round will report their values to the seller as their bids. The bidder with the highest bid larger than the reserved price would win the item and she will pay to the seller the price equal to the second-highest bid or the reserve price, whichever is larger. The seller wants to maximize her total revenue during the time horizon $T$ while learning the distribution of private values over time. The problem is more challenging than the standard online learning scenario since the private value distribution is non-stationary, meaning that the distribution of bidders' private values may change over time, and we need to use the \emph{non-stationary regret} to measure the performance of our algorithm. To our knowledge, this paper is the first to study the repeated auction in the non-stationary setting theoretically. Our algorithm achieves the non-stationary regret upper bound $\tilde{\mathcal{O}}(\min\{\sqrt{\mathcal S T}, \bar{\mathcal{V}}^{\frac{1}{3}}T^{\frac{2}{3}}\})$, where $\mathcal S$ is the number of switches in the distribution, and $\bar{\mathcal{V}}$ is the sum of total variation, and $\mathcal S$ and $\bar{\mathcal{V}}$ are not needed to be known by the algorithm. We also prove regret lower bounds $\Omega(\sqrt{\mathcal S T})$ in the switching case and $\Omega(\bar{\mathcal{V}}^{\frac{1}{3}}T^{\frac{2}{3}})$ in the dynamic case, showing that our algorithm has nearly optimal \emph{non-stationary regret}.

[38]  arXiv:1911.05952 (cross-list from q-fin.ST) [pdf, other]
Title: Change-point Analysis in Financial Networks
Subjects: Statistical Finance (q-fin.ST); Applications (stat.AP)

A major impact of globalization has been the information flow across the financial markets rendering them vulnerable to financial contagion. Research has focused on network analysis techniques to understand the extent and nature of such information flow. It is now an established fact that a stock market crash in one country can have a serious impact on other markets across the globe. It follows that such crashes or critical regimes will affect the network dynamics of the global financial markets. In this paper, we use sequential change point detection in dynamic networks to detect changes in the network characteristics of thirteen stock markets across the globe. Our method helps us to detect changes in network behavior across all known stock market crashes during the period of study. In most of the cases, we can detect a change in the network characteristics prior to crash. Our work thus opens the possibility of using this technique to create a warning bell for critical regimes in financial markets.

[39]  arXiv:1911.05954 (cross-list from cs.LG) [pdf, other]
Title: Hierarchical Graph Pooling with Structure Learning
Comments: Accepted to AAAI-2020; Code is available at this https URL
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Graph Neural Networks (GNNs), which generalize deep neural networks to graph-structured data, have drawn considerable attention and achieved state-of-the-art performance in numerous graph related tasks. However, existing GNN models mainly focus on designing graph convolution operations. The graph pooling (or downsampling) operations, that play an important role in learning hierarchical representations, are usually overlooked. In this paper, we propose a novel graph pooling operator, called Hierarchical Graph Pooling with Structure Learning (HGP-SL), which can be integrated into various graph neural network architectures. HGP-SL incorporates graph pooling and structure learning into a unified module to generate hierarchical representations of graphs. More specifically, the graph pooling operation adaptively selects a subset of nodes to form an induced subgraph for the subsequent layers. To preserve the integrity of graph's topological information, we further introduce a structure learning mechanism to learn a refined graph structure for the pooled graph at each layer. By combining HGP-SL operator with graph neural networks, we perform graph level representation learning with focus on graph classification task. Experimental results on six widely used benchmarks demonstrate the effectiveness of our proposed model.

[40]  arXiv:1911.05956 (cross-list from cs.LG) [pdf, other]
Title: Contextual Bandits Evolving Over Finite Time
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Contextual bandits have the same exploration-exploitation trade-off as standard multi-armed bandits. On adding positive externalities that decay with time, this problem becomes much more difficult as wrong decisions at the start are hard to recover from. We explore existing policies in this setting and highlight their biases towards the inherent reward matrix. We propose a rejection based policy that achieves a low regret irrespective of the structure of the reward probability matrix.

[41]  arXiv:1911.05990 (cross-list from cs.LG) [pdf, other]
Title: Attention on Abstract Visual Reasoning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Attention mechanisms have been boosting the performance of deep learning models on a wide range of applications, ranging from speech understanding to program induction. However, despite experiments from psychology which suggest that attention plays an essential role in visual reasoning, the full potential of attention mechanisms has so far not been explored to solve abstract cognitive tasks on image data. In this work, we propose a hybrid network architecture, grounded on self-attention and relational reasoning. We call this new model Attention Relation Network (ARNe). ARNe combines features from the recently introduced Transformer and the Wild Relation Network (WReN). We test ARNe on the Procedurally Generated Matrices (PGMs) datasets for abstract visual reasoning. ARNe excels the WReN model on this task by 11.28 ppt. Relational concepts between objects are efficiently learned demanding only 35% of the training samples to surpass reported accuracy of the base line model. Our proposed hybrid model, represents an alternative on learning abstract relations using self-attention and demonstrates that the Transformer network is also well suited for abstract visual reasoning.

[42]  arXiv:1911.05996 (cross-list from cs.LG) [pdf, other]
Title: Privacy and Utility Preserving Sensor-Data Transformations
Comments: Accepted to appear in Pervasive and Mobile computing (PMC) Journal, Elsevier
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP); Machine Learning (stat.ML)

Sensitive inferences and user re-identification are major threats to privacy when raw sensor data from wearable or portable devices are shared with cloud-assisted applications. To mitigate these threats, we propose mechanisms to transform sensor data before sharing them with applications running on users' devices. These transformations aim at eliminating patterns that can be used for user re-identification or for inferring potentially sensitive activities, while introducing a minor utility loss for the target application (or task). We show that, on gesture and activity recognition tasks, we can prevent inference of potentially sensitive activities while keeping the reduction in recognition accuracy of non-sensitive activities to less than 5 percentage points. We also show that we can reduce the accuracy of user re-identification and of the potential inference of gender to the level of a random guess, while keeping the accuracy of activity recognition comparable to that obtained on the original data.

[43]  arXiv:1911.05999 (cross-list from cs.LG) [pdf, other]
Title: An Application of Multiple-Instance Learning to Estimate Generalization Risk
Authors: Daiki Suehiro
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We focus on several learning approaches that employ max-operator to evaluate the margin. For example, such approaches are commonly used in multi-class learning task and top-rank learning task. In general, in order to estimate the theoretical generalization risk, we need to individually evaluate the complexity of each hypothesis class used in the learning approaches. In this paper, we provide a technique to estimate a theoretical generalization risk for such learning approaches in a same fashion. The key idea is to "redundantly" reformulate the learning problem as one-class multiple-instance learning by redefining the specific input space based on the original input space. Surprisingly, we succeed to improve the generalization risk bounds for some multi-class learning and top-rank learning algorithms.

[44]  arXiv:1911.06009 (cross-list from cs.LG) [pdf, other]
Title: A Recurrent Probabilistic Neural Network with Dimensionality Reduction Based on Time-series Discriminant Component Analysis
Comments: Published in IEEE Transactions on Neural Networks and Learning Systems
Journal-ref: IEEE Transactions on Neural Networks and Learning Systems, Vol. 26, No.12, pp. 3021-3033, 2015
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

This paper proposes a probabilistic neural network developed on the basis of time-series discriminant component analysis (TSDCA) that can be used to classify high-dimensional time-series patterns. TSDCA involves the compression of high-dimensional time series into a lower-dimensional space using a set of orthogonal transformations and the calculation of posterior probabilities based on a continuous-density hidden Markov model with a Gaussian mixture model expressed in the reduced-dimensional space. The analysis can be incorporated into a neural network, which is named a time-series discriminant component network (TSDCN), so that parameters of dimensionality reduction and classification can be obtained simultaneously as network coefficients according to a backpropagation through time-based learning algorithm with the Lagrange multiplier method. The TSDCN is considered to enable high-accuracy classification of high-dimensional time-series patterns and to reduce the computation time taken for network training. The validity of the TSDCN is demonstrated for high-dimensional artificial data and EEG signals in the experiments conducted during the study.

[45]  arXiv:1911.06015 (cross-list from cs.LG) [pdf, other]
Title: Robust Parameter-Free Season Length Detection in Time Series
Comments: MileTS 2017
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

The in-depth analysis of time series has gained a lot of research interest in recent years, with the identification of periodic patterns being one important aspect. Many of the methods for identifying periodic patterns require time series' season length as input parameter. There exist only a few algorithms for automatic season length approximation. Many of these rely on simplifications such as data discretization and user defined parameters. This paper presents an algorithm for season length detection that is designed to be sufficiently reliable to be used in practical applications and does not require any input other than the time series to be analyzed. The algorithm estimates a time series' season length by interpolating, filtering and detrending the data. This is followed by analyzing the distances between zeros in the directly corresponding autocorrelation function. Our algorithm was tested against a comparable algorithm and outperformed it by passing 122 out of 165 tests, while the existing algorithm passed 83 tests. The robustness of our method can be jointly attributed to both the algorithmic approach and also to design decisions taken at the implementational level.

[46]  arXiv:1911.06028 (cross-list from cs.LG) [pdf, other]
Title: SDGM: Sparse Bayesian Classifier Based on a Discriminative Gaussian Mixture Model
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In probabilistic classification, a discriminative model based on Gaussian mixture exhibits flexible fitting capability. Nevertheless, it is difficult to determine the number of components. We propose a sparse classifier based on a discriminative Gaussian mixture model (GMM), which is named sparse discriminative Gaussian mixture (SDGM). In the SDGM, a GMM-based discriminative model is trained by sparse Bayesian learning. This learning algorithm improves the generalization capability by obtaining a sparse solution and automatically determines the number of components by removing redundant components. The SDGM can be embedded into neural networks (NNs) such as convolutional NNs and can be trained in an end-to-end manner. Experimental results indicated that the proposed method prevented overfitting by obtaining sparsity. Furthermore, we demonstrated that the proposed method outperformed a fully connected layer with the softmax function in certain cases when it was used as the last layer of a deep NN.

[47]  arXiv:1911.06048 (cross-list from cs.LG) [pdf, other]
Title: Conjugate Gradients for Kernel Machines
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Regularized least-squares (kernel-ridge / Gaussian process) regression is a fundamental algorithm of statistics and machine learning. Because generic algorithms for the exact solution have cubic complexity in the number of datapoints, large datasets require to resort to approximations. In this work, the computation of the least-squares prediction is itself treated as a probabilistic inference problem. We propose a structured Gaussian regression model on the kernel function that uses projections of the kernel matrix to obtain a low-rank approximation of the kernel and the matrix. A central result is an enhanced way to use the method of conjugate gradients for the specific setting of least-squares regression as encountered in machine learning. Our method improves the approximation of the kernel ridge regressor / Gaussian process posterior mean over vanilla conjugate gradients and, allows computation of the posterior variance and the log marginal likelihood (evidence) without further overhead.

[48]  arXiv:1911.06057 (cross-list from cs.LG) [pdf, other]
Title: Supplementary material for Uncorrected least-squares temporal difference with lambda-return
Authors: Takayuki Osogami
Comments: 9 pages, supplementary material for an AAAI-20 paper
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Here, we provide a supplementary material for Takayuki Osogami, "Uncorrected least-squares temporal difference with lambda-return," which appears in {\it Proceedings of the 34th AAAI Conference on Artificial Intelligence} (AAAI-20).

[49]  arXiv:1911.06106 (cross-list from q-bio.BM) [pdf]
Title: AMP0: Species-Specific Prediction of Anti-microbial Peptides using Zero and Few Shot Learning
Comments: Under journal submission, 2019
Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Machine Learning (stat.ML)

The evolution of drug-resistant microbial species is one of the major challenges to global health. The development of new antimicrobial treatments such as antimicrobial peptides needs to be accelerated to combat this threat. However, the discovery of novel antimicrobial peptides is hampered by low-throughput biochemical assays. Computational techniques can be used for rapid screening of promising antimicrobial peptide candidates prior to testing in the wet lab. The vast majority of existing antimicrobial peptide predictors are non-targeted in nature, i.e., they can predict whether a given peptide sequence is antimicrobial, but they are unable to predict whether the sequence can target a particular microbial species. In this work, we have developed a targeted antimicrobial peptide activity predictor that can predict whether a peptide is effective against a given microbial species or not. This has been made possible through zero-shot and few-shot machine learning. The proposed predictor called AMP0 takes in the peptide amino acid sequence and any N/C-termini modifications together with the genomic sequence of a target microbial species to generate targeted predictions. It is important to note that the proposed method can generate predictions for species that are not part of its training set. The accuracy of predictions for novel test species can be further improved by providing a few example peptides for that species. Our computational cross-validation results show that the pro-posed scheme is particularly effective for targeted antimicrobial prediction in comparison to existing approaches and can be used for screening potential antimicrobial peptides in a targeted manner especially for cases in which the number of training examples is small. The webserver of the method is available at this http URL

[50]  arXiv:1911.06107 (cross-list from q-bio.BM) [pdf, other]
Title: Earthmover-based manifold learning for analyzing molecular conformation spaces
Comments: 5 pages, 4 figures, 1 table
Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)

In this paper, we propose a novel approach for manifold learning that combines the Earthmover's distance (EMD) with the diffusion maps method for dimensionality reduction. We demonstrate the potential benefits of this approach for learning shape spaces of proteins and other flexible macromolecules using a simulated dataset of 3-D density maps that mimic the non-uniform rotary motion of ATP synthase. Our results show that EMD-based diffusion maps require far fewer samples to recover the intrinsic geometry than the standard diffusion maps algorithm that is based on the Euclidean distance. To reduce the computational burden of calculating the EMD for all volume pairs, we employ a wavelet-based approximation to the EMD which reduces the computation of the pairwise EMDs to a computation of pairwise weighted-$\ell_1$ distances between wavelet coefficient vectors.

[51]  arXiv:1911.06111 (cross-list from cs.CL) [pdf, other]
Title: Instance-based Transfer Learning for Multilingual Deep Retrieval
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)

Perhaps the simplest type of multilingual transfer learning is instance-based transfer learning, in which data from the target language and the auxiliary languages are pooled, and a single model is learned from the pooled data. It is not immediately obvious when instance-based transfer learning will improve performance in this multilingual setting: for instance, a plausible conjecture is this kind of transfer learning would help only if the auxiliary languages were very similar to the target. Here we show that at large scale, this method is surprisingly effective, leading to positive transfer on all of 35 target languages we tested. We analyze this improvement and argue that the most natural explanation, namely direct vocabulary overlap between languages, only partially explains the performance gains: in fact, we demonstrate target-language improvement can occur after adding data from an auxiliary language with no vocabulary in common with the target. This surprising result is due to the effect of transitive vocabulary overlaps between pairs of auxiliary and target languages.

[52]  arXiv:1911.06118 (cross-list from cs.CL) [pdf, ps, other]
Title: Learning Multi-Sense Word Distributions using Approximate Kullback-Leibler Divergence
Comments: 7 pages, 4 tables
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

Learning word representations has garnered greater attention in the recent past due to its diverse text applications. Word embeddings encapsulate the syntactic and semantic regularities of sentences. Modelling word embedding as multi-sense gaussian mixture distributions, will additionally capture uncertainty and polysemy of words. We propose to learn the Gaussian mixture representation of words using a Kullback-Leibler (KL) divergence based objective function. The KL divergence based energy function provides a better distance metric which can effectively capture entailment and distribution similarity among the words. Due to the intractability of KL divergence for Gaussian mixture, we go for a KL approximation between Gaussian mixtures. We perform qualitative and quantitative experiments on benchmark word similarity and entailment datasets which demonstrate the effectiveness of the proposed approach.

[53]  arXiv:1911.06129 (cross-list from cs.LG) [pdf, ps, other]
Title: A Bayesian/Information Theoretic Model of Bias Learning
Authors: Jonathan Baxter
Journal-ref: COLT 96 Proceedings of the ninth annual conference on Computational learning theory (1996) Pages 77-88
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In this paper the problem of learning appropriate bias for an environment of related tasks is examined from a Bayesian perspective. The environment of related tasks is shown to be naturally modelled by the concept of an {\em objective} prior distribution. Sampling from the objective prior corresponds to sampling different learning tasks from the environment. It is argued that for many common machine learning problems, although we don't know the true (objective) prior for the problem, we do have some idea of a set of possible priors to which the true prior belongs. It is shown that under these circumstances a learner can use Bayesian inference to learn the true prior by sampling from the objective prior. Bounds are given on the amount of information required to learn a task when it is simultaneously learnt with several other tasks. The bounds show that if the learner has little knowledge of the true prior, and the dimensionality of the true prior is small, then sampling multiple tasks is highly advantageous.

[54]  arXiv:1911.06154 (cross-list from cs.CL) [pdf, other]
Title: A Massive Collection of Cross-Lingual Web-Document Pairs
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Small-scale efforts have been made to collect aligned document level data on a limited set of language-pairs such as English-German or on limited comparable collections such as Wikipedia. In this paper, we mine twelve snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of 54 million URL pairs from Common Crawl covering documents in 92 languages paired with English. We evaluate the quality of the dataset by measuring the quality of machine translations from models that have been trained on mined parallel sentence pairs from this aligned corpora and introduce a simple yet effective baseline for identifying these aligned documents. The objective of this dataset and paper is to foster new research in cross-lingual NLP across a variety of low, mid, and high-resource languages.

[55]  arXiv:1911.06156 (cross-list from cs.CL) [pdf, other]
Title: Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

Attention-based models have shown significant improvement over traditional algorithms in several NLP tasks. The Transformer, for instance, is an illustrative example that generates abstract representations of tokens inputted to an encoder based on their relationships to all tokens in a sequence. Recent studies have shown that although such models are capable of learning syntactic features purely by seeing examples, explicitly feeding this information to deep learning models can significantly enhance their performance. Leveraging syntactic information like part of speech (POS) may be particularly beneficial in limited training data settings for complex models such as the Transformer. We show that the syntax-infused Transformer with multiple features achieves an improvement of 0.7 BLEU when trained on the full WMT 14 English to German translation dataset and a maximum improvement of 1.99 BLEU points when trained on a fraction of the dataset. In addition, we find that the incorporation of syntax into BERT fine-tuning outperforms baseline on a number of downstream tasks from the GLUE benchmark.

[56]  arXiv:1911.06164 (cross-list from cs.LG) [pdf, ps, other]
Title: Learning Model Bias
Authors: Jonathan Baxter
Journal-ref: Advances in Neural Information Processing Systems 8, 1995, 169-175
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In this paper the problem of {\em learning} appropriate domain-specific bias is addressed. It is shown that this can be achieved by learning many related tasks from the same domain, and a theorem is given bounding the number tasks that must be learnt. A corollary of the theorem is that if the tasks are known to possess a common {\em internal representation} or {\em preprocessing} then the number of examples required per task for good generalisation when learning $n$ tasks simultaneously scales like $O(a + \frac{b}{n})$, where $O(a)$ is a bound on the minimum number of examples required to learn a single task, and $O(a + b)$ is a bound on the number of examples required to learn each task independently. An experiment providing strong qualitative support for the theoretical results is reported.

[57]  arXiv:1911.06182 (cross-list from cs.CL) [pdf, other]
Title: MML: Maximal Multiverse Learning for Robust Fine-Tuning of Language Models
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

Recent state-of-the-art language models utilize a two-phase training procedure comprised of (i) unsupervised pre-training on unlabeled text, and (ii) fine-tuning for a specific supervised task. More recently, many studies have been focused on trying to improve these models by enhancing the pre-training phase, either via better choice of hyperparameters or by leveraging an improved formulation. However, the pre-training phase is computationally expensive and often done on private datasets. In this work, we present a method that leverages BERT's fine-tuning phase to its fullest, by applying an extensive number of parallel classifier heads, which are enforced to be orthogonal, while adaptively eliminating the weaker heads during training. Our method allows the model to converge to an optimal number of parallel classifiers, depending on the given dataset at hand.
We conduct an extensive inter- and intra-dataset evaluations, showing that our method improves the robustness of BERT, sometimes leading to a +9\% gain in accuracy. These results highlight the importance of a proper fine-tuning procedure, especially for relatively smaller-sized datasets. Our code is attached as supplementary and our models will be made completely public.

[58]  arXiv:1911.06187 (cross-list from math.AP) [pdf]
Title: Concordance probability in a big data setting: application in non-life insurance
Subjects: Analysis of PDEs (math.AP); Machine Learning (stat.ML)

The concordance probability or C-index is a popular measure to capture the discriminatory ability of a regression model. In this article, the definition of this measure is adapted to the specific needs of the frequency and severity model, typically used during the technical pricing of a non-life insurance product. Due to the typical large sample size of the frequency data in particular, two different adaptations of the estimation procedure of the concordance probability are presented. Note that the latter procedures can be applied to all different versions of the concordance probability.

[59]  arXiv:1911.06190 (cross-list from eess.SP) [pdf, other]
Title: An Improved Tobit Kalman Filter with Adaptive Censoring Limits
Comments: 21 pages, 32 figures
Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST)

This paper deals with the Tobit Kalman filtering (TKF) process when the measurements are correlated and censored. The case of interval censoring, i.e., the case of measurements which belong to some interval with given censoring limits, is considered. Two improvements of the standard TKF process are proposed, in order to estimate the hidden state vectors. Firstly, the exact covariance matrix of the censored measurements is calculated by taking into account the censoring limits. Secondly, the probability of a latent (normally distributed) measurement to belong in or out of the uncensored region is calculated by taking into account the Kalman residual. The designed algorithm is tested using both synthetic and real data sets. The real data set includes human skeleton joints' coordinates captured by the Microsoft Kinect II sensor. In order to cope with certain real-life situations that cause problems in human skeleton tracking, such as (self)-occlusions, closely interacting persons etc., adaptive censoring limits are used in the proposed TKF process. Experiments show that the proposed method outperforms other filtering processes in minimizing the overall Root Mean Square Error (RMSE) for synthetic and real data sets.

[60]  arXiv:1911.06191 (cross-list from cs.CL) [pdf, other]
Title: Microsoft Research Asia's Systems for WMT19
Comments: Accepted to "Fourth Conference on Machine Translation (WMT19)"
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

We Microsoft Research Asia made submissions to 11 language directions in the WMT19 news translation tasks. We won the first place for 8 of the 11 directions and the second place for the other three. Our basic systems are built on Transformer, back translation and knowledge distillation. We integrate several of our rececent techniques to enhance the baseline systems: multi-agent dual learning (MADL), masked sequence-to-sequence pre-training (MASS), neural architecture optimization (NAO), and soft contextual data augmentation (SCA).

[61]  arXiv:1911.06192 (cross-list from cs.CL) [pdf, other]
Title: Multi-domain Dialogue State Tracking as Dynamic Knowledge Graph Enhanced Question Answering
Authors: Li Zhou, Kevin Small
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

Multi-domain dialogue state tracking (DST) is a critical component for conversational AI systems. The domain ontology (i.e., specification of domains, slots, and values) of a conversational AI system is generally incomplete, making the capability for DST models to generalize to new slots, values, and domains during inference imperative. In this paper, we propose to model multi-domain DST as a question answering problem, referred to as Dialogue State Tracking via Question Answering (DSTQA). Within DSTQA, each turn generates a question asking for the value of a (domain, slot) pair, thus making it naturally extensible to unseen domains, slots, and values. Additionally, we use a dynamically-evolving knowledge graph to explicitly learn relationships between (domain, slot) pairs. Our model has a 5.80% and 12.21% relative improvement over the current state-of-the-art model on MultiWOZ 2.0 and MultiWOZ 2.1 datasets, respectively. Additionally, our model consistently outperforms the state-of-the-art model in domain adaptation settings.

[62]  arXiv:1911.06194 (cross-list from cs.CL) [pdf, other]
Title: Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models
Comments: 12 pages, 9 figures
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

The impressive performance of neural networks on natural language processing tasks attributes to their ability to model complicated word and phrase interactions. Existing flat, word level explanations of predictions hardly unveil how neural networks handle compositional semantics to reach predictions. To tackle the challenge, we study hierarchical explanation of neural network predictions. We identify non-additivity and independent importance attributions within hierarchies as two desirable properties for highlighting word and phrase interactions. We show prior efforts on hierarchical explanations, e.g. contextual decomposition, however, do not satisfy the desired properties mathematically. In this paper, we propose a formal way to quantify the importance of each word or phrase for hierarchical explanations. Following the formulation, we propose Sampling and Contextual Decomposition (SCD) algorithm and Sampling and Occlusion (SOC) algorithm. Human and metrics evaluation on both LSTM models and BERT Transformer models on multiple datasets show that our algorithms outperform prior hierarchical explanation algorithms. Our algorithms apply to hierarchical visualization of compositional semantics, extraction of classification rules and improving human trust of models.

[63]  arXiv:1911.06197 (cross-list from cs.CL) [pdf]
Title: Towards automatic extractive text summarization of A-133 Single Audit reports with machine learning
Comments: 8 pages, first version
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

The rapid growth of text data has motivated the development of machine-learning based automatic text summarization strategies that concisely capture the essential ideas in a larger text. This study aimed to devise an extractive summarization method for A-133 Single Audits, which assess if recipients of federal grants are compliant with program requirements for use of federal funding. Currently, these voluminous audits must be manually analyzed by officials for oversight, risk management, and prioritization purposes. Automated summarization has the potential to streamline these processes. Analysis focused on the "Findings" section of ~20,000 Single Audits spanning 2016-2018. Following text preprocessing and GloVe embedding, sentence-level k-means clustering was performed to partition sentences by topic and to establish the importance of each sentence. For each audit, key summary sentences were extracted by proximity to cluster centroids. Summaries were judged by non-expert human evaluation and compared to human-generated summaries using the ROUGE metric. Though the goal was to fully automate summarization of A-133 audits, human input was required at various stages due to large variability in audit writing style, content, and context. Examples of human inputs include the number of clusters, the choice to keep or discard certain clusters based on their content relevance, and the definition of a top sentence. Overall, this approach made progress towards automated extractive summaries of A-133 audits, with future work to focus on full automation and improving summary consistency. This work highlights the inherent difficulty and subjective nature of automated summarization in a real-world application.

[64]  arXiv:1911.06204 (cross-list from cond-mat.stat-mech) [pdf, other]
Title: Estimating differential entropy using recursive copula splitting
Subjects: Statistical Mechanics (cond-mat.stat-mech); Statistics Theory (math.ST)

A method for estimating the Shannon differential entropy of multidimensional random variables using independent samples is described. The method is based on decomposing the distribution into a product of the marginal distributions and the joint dependency, also known as the copula. The entropy of marginals is estimated using one-dimensional methods. The entropy of the copula, which always has a compact support, is estimated recursively by splitting the data along statistically dependent dimensions. Numerical examples demonstrate that the method is accurate for distributions with compact and non-compact supports, which is imperative when the support is not known or of mixed type (in different dimensions). At high dimensions (larger than 20), our method is not only more accurate, but also significantly more efficient than existing approaches.

[65]  arXiv:1911.06217 (cross-list from cs.LG) [pdf, other]
Title: On Network Embedding for Machine Learning on Road Networks: A Case Study on the Danish Road Network
Comments: \c{opyright} 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Journal-ref: 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 3422-3431
Subjects: Machine Learning (cs.LG); Databases (cs.DB); Machine Learning (stat.ML)

Road networks are a type of spatial network, where edges may be associated with qualitative information such as road type and speed limit. Unfortunately, such information is often incomplete; for instance, OpenStreetMap only has speed limits for 13% of all Danish road segments. This is problematic for analysis tasks that rely on such information for machine learning. To enable machine learning in such circumstances, one may consider the application of network embedding methods to extract structural information from the network. However, these methods have so far mostly been used in the context of social networks, which differ significantly from road networks in terms of, e.g., node degree and level of homophily (which are key to the performance of many network embedding methods). We analyze the use of network embedding methods, specifically node2vec, for learning road segment embeddings in road networks. Due to the often limited availability of information on other relevant road characteristics, the analysis focuses on leveraging the spatial network structure. Our results suggest that network embedding methods can indeed be used for deriving relevant network features (that may, e.g, be used for predicting speed limits), but that the qualities of the embeddings differ from embeddings for social networks.

[66]  arXiv:1911.06242 (cross-list from eess.SP) [pdf, other]
Title: Condition monitoring and early diagnostics methodologies for hydropower plants
Authors: Alessandro Betti (1), Emanuele Crisostomi (2), Gianluca Paolinelli (3), Antonio Piazzi (1), Fabrizio Ruffini (1), Mauro Tucci (2) ((1) i-EM S.r.l., (2) Department of Energy, Systems, Territory and Constructions Engineering, University of Pisa and (3) Pure Power Control S.r.l.)
Comments: 8 pages, 4 figures. This work has been submitted to the Elsevier Renewable Energy for possible publication
Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)

Hydropower plants are one of the most convenient option for power generation, as they generate energy exploiting a renewable source, they have relatively low operating and maintenance costs, and they may be used to provide ancillary services, exploiting the large reservoirs of available water. The recent advances in Information and Communication Technologies (ICT) and in machine learning methodologies are seen as fundamental enablers to upgrade and modernize the current operation of most hydropower plants, in terms of condition monitoring, early diagnostics and eventually predictive maintenance. While very few works, or running technologies, have been documented so far for the hydro case, in this paper we propose a novel Key Performance Indicator (KPI) that we have recently developed and tested on operating hydropower plants. In particular, we show that after more than one year of operation it has been able to identify several faults, and to support the operation and maintenance tasks of plant operators. Also, we show that the proposed KPI outperforms conventional multivariable process control charts, like the Hotelling $t_2$ index.

[67]  arXiv:1911.06256 (cross-list from cs.LG) [pdf, other]
Title: A Comparative Study between Bayesian and Frequentist Neural Networks for Remaining Useful Life Estimation in Condition-Based Maintenance
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In the last decade, deep learning (DL) has outperformed model-based and statistical approaches in predicting the remaining useful life (RUL) of machinery in the context of condition-based maintenance. One of the major drawbacks of DL is that it heavily depends on a large amount of labeled data, which are typically expensive and time-consuming to obtain, especially in industrial applications. Scarce training data lead to uncertain estimates of the model's parameters, which in turn result in poor prognostic performance. Quantifying this parameter uncertainty is important in order to determine how reliable the prediction is. Traditional DL techniques such as neural networks are incapable of capturing the uncertainty in the training data, thus they are overconfident about their estimates. On the contrary, Bayesian deep learning has recently emerged as a promising solution to account for uncertainty in the training process, achieving state-of-the-art performance in many classification and regression tasks. In this work Bayesian DL techniques such as Bayesian dense neural networks and Bayesian convolutional neural networks are applied to RUL estimation and compared to their frequentist counterparts from the literature. The effectiveness of the proposed models is verified on the popular C-MAPSS dataset. Furthermore, parameter uncertainty is quantified and used to gain additional insight into the data.

[68]  arXiv:1911.06257 (cross-list from cs.LG) [pdf, other]
Title: ViWi: A Deep Learning Dataset Framework for Vision-Aided Wireless Communications
Comments: The ViWi datasets and applications are available at this https URL
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Machine Learning (stat.ML)

The growing role that artificial intelligence and specifically machine learning is playing in shaping the future of wireless communications has opened up many new and intriguing research directions. This paper motivates the research in the novel direction of \textit{vision-aided wireless communications}, which aims at leveraging visual sensory information in tackling wireless communication problems. Like any new research direction driven by machine learning, obtaining a development dataset poses the first and most important challenge to vision-aided wireless communications. This paper addresses this issue by introducing the Vision-Wireless (ViWi) dataset framework. It is developed to be a parametric, systematic, and scalable data generation framework. It utilizes advanced 3D-modeling and ray-tracing softwares to generate high-fidelity synthetic wireless and vision data samples for the same scenes. The result is a framework that does not only offer a way to generate training and testing datasets but helps provide a common ground on which the quality of different machine learning-powered solutions could be assessed.

[69]  arXiv:1911.06267 (cross-list from quant-ph) [pdf, other]
Title: A regression algorithm for accelerated lattice QCD that exploits sparse inference on the D-Wave quantum annealer
Comments: 6 pages, 4 figures
Subjects: Quantum Physics (quant-ph); High Energy Physics - Lattice (hep-lat); Machine Learning (stat.ML)

We propose a regression algorithm that utilizes a learned dictionary optimized for sparse inference on D-Wave quantum annealer. In this regression algorithm, we concatenate the independent and dependent variables as an combined vector, and encode the high-order correlations between them into a dictionary optimized for sparse reconstruction. On a test dataset, the dependent variable is initialized to its average value and then a sparse reconstruction of the combined vector is obtained in which the dependent variable is typically shifted closer to its true value, as in a standard inpainting or denoising task. Here, a quantum annealer, which can presumably exploit a fully entangled initial state to better explore the complex energy landscape, is used to solve the highly non-convex sparse coding optimization problem. The regression algorithm is demonstrated for a lattice quantum chromodynamics simulation data using a D-Wave 2000Q quantum annealer and good prediction performance is achieved. The regression test is performed using six different values for the number of fully connected logical qubits, between 20 and 64, the latter being the maximum that can be embedded on the D-Wave 2000Q. The scaling results indicate that a larger number of qubits gives better prediction accuracy, the best performance being comparable to the best classical regression algorithms reported so far.

[70]  arXiv:1911.06285 (cross-list from cs.LG) [pdf, other]
Title: DomainGAN: Generating Adversarial Examples to Attack Domain Generation Algorithm Classifiers
Comments: 8 pages, 9 figures
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Signal Processing (eess.SP); Machine Learning (stat.ML)

Domain Generation Algorithms (DGAs) are frequently used to generate large numbers of domains for use by botnets. These domains are often used as rendezvous points for the servers that malware has command and control over. There are many algorithms that are used to generate domains, but many of these algorithms are simplistic and are very easy to detect using classical machine learning techniques. In this paper, three different variants of generative adversarial networks (GANs) are used to improve domain generation by making the domains more difficult for machine learning algorithms to detect. The domains generated by traditional DGAs and the GAN based DGA are then compared by using state of the art machine learning based DGA classifiers. The results show that the GAN based DGAs gets detected by the DGA classifiers significantly less than the traditional DGAs. An analysis of the GAN variants is also performed to show which GAN variant produces the most usable domains. As verified by testing results and analysis, the Wasserstein GAN with Gradient Penalty (WGANGP), is the best GAN variant to use as a DGA.

[71]  arXiv:1911.06286 (cross-list from math.NA) [pdf, ps, other]
Title: Importance sampling for a robust and efficient multilevel Monte Carlo estimator for stochastic reaction networks
Subjects: Numerical Analysis (math.NA); Computation (stat.CO)

The multilevel Monte Carlo (MLMC) method for continuous time Markov chains, first introduced by Anderson and Higham (2012), is a highly efficient simulation technique that can be used to estimate various statistical quantities for stochastic reaction networks (SRNs), and in particular for stochastic biological systems. Unfortunately, the robustness and performance of the multilevel method can be deteriorated due to the phenomenon of high kurtosis, observed at the deep levels of MLMC, which leads to inaccurate estimates for the sample variance. In this work, we address cases where the high-kurtosis phenomenon is due to \textit{catastrophic coupling} (characteristic of pure jump processes where coupled consecutive paths are identical in most of the simulations, while differences only appear in a very small proportion), and introduce a pathwise dependent importance sampling technique that improves the robustness and efficiency of the multilevel method. Our analysis, along with the conducted numerical experiments, demonstrates that our proposed method significantly reduces the kurtosis of the deep levels of MLMC, and also improves the strong convergence rate from $\beta=1$ for the standard case (without importance sampling), to $\beta=1+\delta$, where $0<\delta<1$ is a user-selected parameter in our importance sampling algorithm. Due to the complexity theorem of MLMC and given a pre-selected tolerance, $TOL$, this results in an improvement of the complexity from $\mathcal{O}\left(TOL^{-2} \log(TOL)^2\right)$ in the standard case to $\mathcal{O}\left(TOL^{-2}\right)$.

[72]  arXiv:1911.06316 (cross-list from eess.SP) [pdf, other]
Title: Real-time Anomaly Detection and Classification in Streaming PMU Data
Comments: 9 pages, 12 figures
Subjects: Signal Processing (eess.SP); Machine Learning (stat.ML)

Ensuring secure and reliable operations of the power grid is a primary concern of system operators. Phasor measurement units (PMUs) are rapidly being deployed in the grid to provide fast-sampled operational data that should enable quicker decision-making. This work presents a general interpretable framework for analyzing real-time PMU data, and thus enabling grid operators to understand the current state and to identify anomalies on the fly. Applying statistical learning tools on the streaming data, we first learn an effective dynamical model to describe the current behavior of the system. Next, we use the probabilistic predictions of our learned model to define in a principled way an efficient anomaly detection tool. Finally, the last module of our framework produces on-the-fly classification of the detected anomalies into common occurrence classes using features that grid operators are familiar with. We demonstrate the efficacy of our interpretable approach through extensive numerical experiments on real PMU data collected from a transmission operator in the USA.

[73]  arXiv:1911.06317 (cross-list from cs.LG) [pdf, other]
Title: Gradientless Descent: High-Dimensional Zeroth-Order Optimization
Comments: 11 main pages, 26 total pages
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Zeroth-order optimization is the process of minimizing an objective $f(x)$, given oracle access to evaluations at adaptively chosen inputs $x$. In this paper, we present two simple yet powerful GradientLess Descent (GLD) algorithms that do not rely on an underlying gradient estimate and are numerically stable. We analyze our algorithm from a novel geometric perspective and present a novel analysis that shows convergence within an $\epsilon$-ball of the optimum in $O(kQ\log(n)\log(R/\epsilon))$ evaluations, for {\it any monotone transform} of a smooth and strongly convex objective with latent dimension $k < n$, where the input dimension is $n$, $R$ is the diameter of the input space and $Q$ is the condition number. Our rates are the first of its kind to be both 1) poly-logarithmically dependent on dimensionality and 2) invariant under monotone transformations. We further leverage our geometric perspective to show that our analysis is optimal. Both monotone invariance and its ability to utilize a low latent dimensionality are key to the empirical success of our algorithms, as demonstrated on BBOB and MuJoCo benchmarks.

[74]  arXiv:1911.06319 (cross-list from cs.LG) [pdf, ps, other]
Title: The Canonical Distortion Measure for Vector Quantization and Function Approximation
Authors: Jonathan Baxter
Journal-ref: In: Thrun S., Pratt L. (eds) Learning to Learn (1998). Pages 159-177
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

To measure the quality of a set of vector quantization points a means of measuring the distance between a random point and its quantization is required. Common metrics such as the {\em Hamming} and {\em Euclidean} metrics, while mathematically simple, are inappropriate for comparing natural signals such as speech or images. In this paper it is shown how an {\em environment} of functions on an input space $X$ induces a {\em canonical distortion measure} (CDM) on X. The depiction 'canonical" is justified because it is shown that optimizing the reconstruction error of X with respect to the CDM gives rise to optimal piecewise constant approximations of the functions in the environment. The CDM is calculated in closed form for several different function classes. An algorithm for training neural networks to implement the CDM is presented along with some encouraging experimental results.

Replacements for Fri, 15 Nov 19

[75]  arXiv:1108.2883 (replaced) [pdf, ps, other]
Title: Bayesian test of normality versus a Dirichlet process mixture alternative
Comments: 24 pages, 5 figures, 1 table
Subjects: Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
[76]  arXiv:1610.10028 (replaced) [pdf, other]
Title: Refiltering hypothesis tests to control sign error
Authors: Art B. Owen
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)
[77]  arXiv:1612.08288 (replaced) [pdf, ps, other]
Title: Instrumental Variable Quantile Regression with Misclassification
Authors: Takuya Ura
Subjects: Methodology (stat.ME)
[78]  arXiv:1707.09049 (replaced) [pdf, other]
Title: Variational Joint Filtering
Subjects: Machine Learning (stat.ML)
[79]  arXiv:1801.03583 (replaced) [pdf, other]
Title: Graphical Models for Processing Missing Data
Comments: 34 pages, 5 figures
Subjects: Methodology (stat.ME)
[80]  arXiv:1801.08120 (replaced) [pdf, other]
Title: Optimal Estimation of Simultaneous Signals Using Absolute Inner Product with Applications to Integrative Genomics
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
[81]  arXiv:1805.06970 (replaced) [pdf, other]
Title: Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models
Subjects: Methodology (stat.ME)
[82]  arXiv:1807.05832 (replaced) [pdf, ps, other]
Title: Manifold Adversarial Learning
Comments: 11 pages, 26 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[83]  arXiv:1809.02463 (replaced) [pdf, other]
Title: Dirichlet process mixtures under affine transformations of the data
Comments: 35 pages, 7 Figures
Subjects: Methodology (stat.ME)
[84]  arXiv:1811.08039 (replaced) [pdf, other]
Title: Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[85]  arXiv:1901.09078 (replaced) [pdf, other]
Title: Finding Archetypal Spaces Using Neural Networks
Comments: 9 pages, 10 figures, to be presented at IEEE Big Data 2019
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[86]  arXiv:1903.00904 (replaced) [pdf, other]
Title: adVAE: A self-adversarial variational autoencoder with Gaussian anomaly prior knowledge for anomaly detection
Comments: This paper has been accepted by Knowledge-based Systems
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[87]  arXiv:1903.02380 (replaced) [pdf, other]
Title: Detecting Overfitting via Adversarial Examples
Comments: 17 pages
Journal-ref: Part of: Advances in Neural Information Processing Systems 32 (NIPS 2019) pre-proceedings
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[88]  arXiv:1903.03894 (replaced) [pdf, other]
Title: GNNExplainer: Generating Explanations for Graph Neural Networks
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[89]  arXiv:1903.08987 (replaced) [pdf, other]
Title: Some New Copula Based Distribution-free Tests of Independence among Several Random Variables
Comments: arXiv admin note: text overlap with arXiv:1708.07485
Subjects: Statistics Theory (math.ST)
[90]  arXiv:1904.07199 (replaced) [pdf, other]
Title: Exact Rate-Distortion in Autoencoders via Echo Noise
Comments: NeurIPS 2019; updated Gaussian baseline results, added disentanglement
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
[91]  arXiv:1904.08497 (replaced) [pdf, other]
Title: An In-Depth Study on Open-Set Camera Model Identification
Comments: Published through IEEE Access journal
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[92]  arXiv:1904.10921 (replaced) [pdf, other]
Title: Plug-in, Trainable Gate for Streamlining Arbitrary Neural Networks
Comments: Accepted to AAAI 2020 (Poster)
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[93]  arXiv:1905.00626 (replaced) [pdf, other]
Title: On Linear Learning with Manycore Processors
Comments: To appear in: 2019 IEEE 26th International Conference on High Performance Computing (HiPC)
Subjects: Performance (cs.PF); Machine Learning (cs.LG); Machine Learning (stat.ML)
[94]  arXiv:1905.10259 (replaced) [pdf, other]
Title: Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks
Comments: 22 pages. Accepted for publication at NeurIPS 2019
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[95]  arXiv:1905.11232 (replaced) [pdf, other]
Title: Efficient posterior sampling for high-dimensional imbalanced logistic regression
Comments: 4 figures
Subjects: Methodology (stat.ME); Computation (stat.CO)
[96]  arXiv:1905.11614 (replaced) [pdf, other]
Title: Uncertainty-based Continual Learning with Adaptive Regularization
Comments: 10 pages (including Supplementary Materials), Neurips 2019 camera ready version
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[97]  arXiv:1906.00531 (replaced) [pdf, other]
Title: Model selection for contextual bandits
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
[98]  arXiv:1906.02685 (replaced) [pdf, other]
Title: Stochastic Bandits with Context Distributions
Comments: Accepted at NeurIPS 2019
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[99]  arXiv:1906.04159 (replaced) [pdf, other]
Title: Inference and Uncertainty Quantification for Noisy Matrix Completion
Comments: published at Proceedings of the National Academy of Sciences Nov 2019, 116 (46) 22931-22937
Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Statistics Theory (math.ST)
[100]  arXiv:1906.04328 (replaced) [pdf, other]
Title: Importance Resampling for Off-policy Prediction
Comments: Recently published in NeurIPS 2019
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[101]  arXiv:1906.04834 (replaced) [pdf, other]
Title: Relaxed random walks at scale
Comments: 18 pages, 4 figures
Subjects: Populations and Evolution (q-bio.PE); Methodology (stat.ME)
[102]  arXiv:1906.06899 (replaced) [pdf, other]
Title: A Provably Correct and Robust Algorithm for Convolutive Nonnegative Matrix Factorization
Comments: 24 pages, 4 figures, references updated
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
[103]  arXiv:1908.01109 (replaced) [pdf, other]
Title: The Use of Binary Choice Forests to Model and Estimate Discrete Choices
Comments: 56 pages, 4 figures, 11 tables
Subjects: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
[104]  arXiv:1908.03015 (replaced) [pdf, other]
Title: Augmenting Variational Autoencoders with Sparse Labels: A Unified Framework for Unsupervised, Semi-(un)supervised, and Supervised Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
[105]  arXiv:1908.07832 (replaced) [pdf, other]
Title: Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
[106]  arXiv:1909.05289 (replaced) [pdf, other]
Title: Deep Prediction of Investor Interest: a Supervised Clustering Approach
Subjects: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Machine Learning (stat.ML)
[107]  arXiv:1910.06539 (replaced) [pdf, other]
Title: Challenges in Bayesian inference via Markov chain Monte Carlo for neural networks
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
[108]  arXiv:1910.07295 (replaced) [pdf, other]
Title: Unsupervised Domain Adaptation Meets Offline Recommender Learning
Authors: Yuta Saito
Comments: accepted to the NewInML forum (co-located with NeurIPS 2019)
Subjects: Machine Learning (stat.ML); Information Retrieval (cs.IR); Machine Learning (cs.LG)
[109]  arXiv:1910.10308 (replaced) [pdf, other]
Title: Weighted Distributed Differential Privacy ERM: Convex and Non-convex
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[110]  arXiv:1910.13573 (replaced) [pdf, other]
Title: Semi-Supervised Natural Language Approach for Fine-Grained Classification of Medical Reports
Comments: Accepted for IEEE publication & presented at MIT URTC
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
[111]  arXiv:1911.00348 (replaced) [pdf, other]
Title: Hierarchical Expert Networks for Meta-Learning
Comments: arXiv admin note: text overlap with arXiv:1907.11452
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[112]  arXiv:1911.00847 (replaced) [pdf, other]
Title: Weakly Supervised Deep Learning Approach in Streaming Environments
Comments: This paper has been accepted for publication in The 2019 IEEE International Conference on Big Data (IEEE BigData 2019), Los Angeles, CA, USA
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
[113]  arXiv:1911.01731 (replaced) [pdf, other]
Title: GraphAIR: Graph Representation Learning with Neighborhood Aggregation and Interaction
Comments: 8 pages, in submission to IEEE Transactions on Knowledge and Data Engineering
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[114]  arXiv:1911.02915 (replaced) [pdf, other]
Title: A Statistically Identifiable Model for Tensor-Valued Gaussian Random Variables
Comments: 14 pages, 12 figures
Subjects: Signal Processing (eess.SP); Information Theory (cs.IT); Statistics Theory (math.ST)
[115]  arXiv:1911.02966 (replaced) [pdf]
Title: An automated approach for task evaluation using EEG signals
Comments: 19 pages, 10 figures, 4 tables
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
[116]  arXiv:1911.04448 (replaced) [pdf, other]
Title: Real-Time Reinforcement Learning
Comments: NeurIPS 2019
Journal-ref: Neural Information Processing Systems (2019)
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[117]  arXiv:1911.05109 (replaced) [pdf, other]
Title: Harmonic Mean Point Processes: Proportional Rate Error Minimization for Obtundation Prediction
Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[118]  arXiv:1911.05211 (replaced) [pdf, other]
Title: AMPL: A Data-Driven Modeling Pipeline for Drug Discovery
Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
[119]  arXiv:1911.05309 (replaced) [pdf, other]
Title: Adaptive Portfolio by Solving Multi-armed Bandit via Thompson Sampling
Comments: conference
Subjects: Machine Learning (cs.LG); Portfolio Management (q-fin.PM); Machine Learning (stat.ML)
[120]  arXiv:1911.05485 (replaced) [pdf, ps, other]
Title: Diffusion Improves Graph Learning
Comments: Published as a conference paper at NeurIPS 2019
Journal-ref: Thirty-third Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019
Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
[121]  arXiv:1911.05684 (replaced) [pdf, other]
Title: A Simulation-free Group Sequential Design with Max-combo Tests in the Presence of Non-proportional Hazards
Authors: Lili Wang (1), Xiaodong Luo (2), Cheng Zheng (2) ((1) Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, U.S.A. (2) Department of Biostatistics and Programming, Research and Development, Sanofi US, Bridgewater, New Jersey, U.S.A.)
Subjects: Methodology (stat.ME)
[ total of 121 entries: 1-121 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 1911, contact, help  (Access key information)