Statistics
New submissions
[ showing up to 2000 entries per page: fewer  more ]
New submissions for Fri, 15 Nov 19
 [1] arXiv:1911.05754 [pdf, other]

Title: Implicit Hamiltonian Monte Carlo for Sampling Multiscale DistributionsSubjects: Computation (stat.CO); Methodology (stat.ME)
Hamiltonian Monte Carlo (HMC) has been widely adopted in the statistics community because of its ability to sample highdimensional distributions much more efficiently than other Metropolisbased methods. Despite this, HMC often performs suboptimally on distributions with high correlations or marginal variances on multiple scales because the resulting stiffness forces the leapfrog integrator in HMC to take an unreasonably small stepsize. We provide intuition as well as a formal analysis showing how these multiscale distributions limit the stepsize of leapfrog and we show how the implicit midpoint method can be used, together with NewtonKrylov iteration, to circumvent this limitation and achieve major efficiency gains. Furthermore, we offer practical guidelines for when to choose between implicit midpoint and leapfrog and what stepsize to use for each method, depending on the distribution being sampled. Unlike previous modifications to HMC, our method is generally applicable to highly nonGaussian distributions exhibiting multiple scales. We illustrate how our method can provide a dramatic speedup over leapfrog in the context of the NoUTurn sampler (NUTS) applied to several examples.
 [2] arXiv:1911.05770 [pdf, other]

Title: Constrained Bayesian ICA for Brain Connectome InferenceSubjects: Applications (stat.AP); Neurons and Cognition (qbio.NC)
Brain connectomics is a developing field in neurosciences which strives to understand cognitive processes and psychiatric diseases through the analysis of interactions between brain regions. However, in the highdimensional, lowsample, and noisy regimes that typically characterize fMRI data, the recovery of such interactions remains an ongoing challenge: how can we discover patterns of coactivity between brain regions that could then be associated to cognitive processes or psychiatric disorders? In this paper, we investigate a constrained Bayesian ICA approach which, in comparison to current methods, simultaneously allows (a) the flexible integration of multiple sources of information (fMRI, DWI, anatomical, etc.), (b) an automatic and parameterfree selection of the appropriate sparsity level and number of connected submodules and (c) the provision of estimates on the uncertainty of the recovered interactions. Our experiments, both on synthetic and reallife data, validate the flexibility of our method and highlight the benefits of integrating anatomical information for connectome inference.
 [3] arXiv:1911.05822 [pdf, ps, other]

Title: A Model of Double Descent for Highdimensional Binary Linear ClassificationComments: Short version submitted to ICASSP 2020Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
We consider a model for logistic regression where only a subset of features of size $p$ is used for training a linear classifier over $n$ training samples. The classifier is obtained by running gradientdescent (GD) on the logisticloss. For this model, we investigate the dependence of the generalization error on the overparameterization ratio $\kappa=p/n$. First, building on known deterministic results on convergence properties of the GD, we uncover a phasetransition phenomenon for the case of Gaussian regressors: the generalization error of GD is the same as that of the maximumlikelihood (ML) solution when $\kappa<\kappa_\star$, and that of the maxmargin (SVM) solution when $\kappa>\kappa_\star$. Next, using the convex Gaussian minmax theorem (CGMT), we sharply characterize the performance of both the ML and SVM solutions. Combining these results, we obtain curves that explicitly characterize the generalization error of GD for varying values of $\kappa$. The numerical results validate the theoretical predictions and unveil doubledescent phenomena that complement similar recent observations in linear regression settings.
 [4] arXiv:1911.05865 [pdf, other]

Title: Kriging: Beyond MatérnSubjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
The Mat\'ern covariance function is a popular choice for prediction in spatial statistics and uncertainty quantification literature. A key benefit of the Mat\'ern class is that it is possible to get precise control over the degree of differentiability of the process realizations. However, the Mat\'ern class possesses exponentially decaying tails, and thus may not be suitable for modeling long range dependence. This problem can be remedied using polynomial covariances; however one loses control over the degree of differentiability of the process realizations, in that the realizations using polynomial covariances are either infinitely differentiable or not differentiable at all. We construct a new family of covariance functions using a scale mixture representation of the Mat\'ern class where one obtains the benefits of both Mat\'ern and polynomial covariances. The resultant covariance contains two parameters: one controls the degree of differentiability near the origin and the other controls the tail heaviness, independently of each other. Using a spectral representation, we derive theoretical properties of this new covariance including equivalence measures and asymptotic behavior of the maximum likelihood estimators under infill asymptotics. The improved theoretical properties in predictive performance of this new covariance class are verified via extensive simulations. Application using NASA's Orbiting Carbon Observatory2 satellite data confirms the advantage of this new covariance class over the Mat\'ern class, especially in extrapolative settings.
 [5] arXiv:1911.05881 [pdf, other]

Title: Projecting FloodInducing Precipitation with a Bayesian Analogue ModelSubjects: Applications (stat.AP)
The hazard of pluvial flooding is largely influenced by the spatial and temporal dependence characteristics of precipitation. When extreme precipitation possesses strong spatial dependence, the risk of flooding is amplified due to catchment factors that cause runoff accumulation such as topography. Temporal dependence can also increase flood risk as storm water drainage systems operating at capacity can be overwhelmed by heavy precipitation occurring over multiple days. While transformed Gaussian processes are common choices for modeling precipitation, their weak tail dependence may lead to underestimation of flood risk. Extreme value models such as the generalized Pareto processes for threshold exceedances and maxstable models are attractive alternatives, but are difficult to fit when the number of observation sites is large, and are of little use for modeling the bulk of the distribution, which may also be of interest to water management planners. While the atmospheric dynamics governing precipitation are complex and difficult to fully incorporate into a parsimonious statistical model, nonmechanistic analogue methods that approximate those dynamics have proven to be promising approaches to capturing the temporal dependence of precipitation. In this paper, we present a Bayesian analogue method that leverages large, synopticscale atmospheric patterns to make precipitation forecasts. Changing spatial dependence across varying intensities is modeled as a mixture of spatial Studentt processes that can accommodate both strong and weak tail dependence. The proposed model demonstrates improved performance at capturing the distribution of extreme precipitation over Community Atmosphere Model (CAM) 5.2 forecasts.
 [6] arXiv:1911.05934 [pdf, other]

Title: Bayesian Optimization with Uncertain Preferences over AttributesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
We consider blackbox global optimization of timeconsumingtoevaluate functions on behalf of a decisionmaker whose preferences must be learned. Each feasible design is associated with a timeconsumingtoevaluate vector of attributes, each vector of attributes is assigned a utility by the decisionmaker's utility function, and this utility function may be learned approximately using preferences expressed by the decisionmaker over pairs of attribute vectors. Past work has used this estimated utility function as if it were errorfree within singleobjective optimization. However, errors in utility estimation may yield a poor suggested decision. Furthermore, this approach produces a single suggested "best" design, whereas decisionmakers often prefer to choose among a menu of designs. We propose a novel Bayesian optimization algorithm that acknowledges the uncertainty in preference estimation and implicitly chooses designs to evaluate using the timeconsuming function that are good not just for a single estimated utility function but a range of likely utility functions. Our algorithm then shows a menu of designs and evaluated attributes to the decisionmaker who makes a final selection. We demonstrate the value of our algorithm in a variety of numerical experiments.
 [7] arXiv:1911.05940 [pdf, other]

Title: Distributional Clustering: A distributionpreserving clustering methodComments: Submitted to Statistica SinicaSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
One key use of kmeans clustering is to identify cluster prototypes which can serve as representative points for a dataset. However, a drawback of using kmeans cluster centers as representative points is that such points distort the distribution of the underlying data. This can be highly disadvantageous in problems where the representative points are subsequently used to gain insights on the data distribution, as these points do not mimic the distribution of the data. To this end, we propose a new clustering method called "distributional clustering", which ensures cluster centers capture the distribution of the underlying data. We first prove the asymptotic convergence of the proposed cluster centers to the data generating distribution, then present an efficient algorithm for computing these cluster centers in practice. Finally, we demonstrate the effectiveness of distributional clustering on synthetic and real datasets.
 [8] arXiv:1911.05970 [pdf, other]

Title: Empirical Bayes mean estimation with nonparametric errors via order statistic regressionSubjects: Methodology (stat.ME)
We study empirical Bayes estimation of the effect sizes of $N$ units from $K$ noisy observations on each unit. We show that it is possible to achieve nearBayes optimal mean squared error, without any assumptions or knowledge about the effect size distribution or the noise. The noise distribution can be heteroskedastic and vary arbitrarily from unit to unit. Our proposal, which we call Aurora, leverages the replication inherent in the $K$ observations per unit and recasts the effect size estimation problem as a general regression problem. Aurora with linear regression provably matches the performance of a wide array of estimators including the sample mean, the trimmed mean, the sample median, as well as JamesStein shrunk versions thereof. Aurora automates effect size estimation for Internetscale datasets, as we demonstrate on Google data.
 [9] arXiv:1911.06006 [pdf, other]

Title: An Invariant Test for Equality of Two Large Scale Covariance MatricesSubjects: Statistics Theory (math.ST)
In this work, we are motivated by the recent work of Zhang et al. (2019) and study a new invariant test for equality of two large scale covariance matrices. Two modified likelihood ratio tests (LRTs) by Zhang et al. (2019) are based on the sum of log of eigenvalues (or 1 eigenvalues) of the Betamatrix. However, as the dimension increases, many eigenvalues of the Betamatrix are close to 0 or 1 and the modified LRTs are greatly influenced by them. In this work, instead, we consider the simple sum of the eigenvalues (of the Betamatrix) and compute its asymptotic normality when all $n_1, n_2, p$ increase at the same rate. We numerically show that our test has higher power than two modified likelihood ratio tests by Zhang et al. (2019) in all cases both we and they consider.
 [10] arXiv:1911.06030 [pdf]

Title: Guidelines for estimating causal effects in pragmatic randomized trialsSubjects: Methodology (stat.ME)
Pragmatic randomized trials are designed to provide evidence for clinical decisionmaking rather than regulatory approval. Common features of these trials include the inclusion of heterogeneous or diverse patient populations in a wide range of care settings, the use of active treatment strategies as comparators, unblinded treatment assignment, and the study of longterm, clinically relevant outcomes. These features can greatly increase the usefulness of the trial results for patients, clinicians, and other stakeholders. However, these features also introduce an increased risk of nonadherence, which reduces the value of the intentiontotreat effect as a patientcentered measure of causal effect. In these settings, the perprotocol effect provides useful complementary information for decision making. Unfortunately, there is little guidance for valid estimation of the perprotocol effect. Here, we present our full guidelines for analyses of pragmatic trials that will result in more informative causal inferences for both the intentiontotreat effect and the perprotocol effect.
 [11] arXiv:1911.06177 [pdf, other]

Title: Uncertainty Quantification in Ensembles of Honest Regression Trees using Generalized Fiducial InferenceSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
Due to their accuracies, methods based on ensembles of regression trees are a popular approach for making predictions. Some common examples include Bayesian additive regression trees, boosting and random forests. This paper focuses on honest random forests, which add honesty to the original form of random forests and are proved to have better statistical properties. The main contribution is a new method that quantifies the uncertainties of the estimates and predictions produced by honest random forests. The proposed method is based on the generalized fiducial methodology, and provides a fiducial density function that measures how likely each single honest tree is the true model. With such a density function, estimates and predictions, as well as their confidence/prediction intervals, can be obtained. The promising empirical properties of the proposed method are demonstrated by numerical comparisons with several stateoftheart methods, and by applications to a few real data sets. Lastly, the proposed method is theoretically backed up by a strong asymptotic guarantee.
 [12] arXiv:1911.06213 [pdf, other]

Title: Analysis of the fiber laydown quality in spunbond processes with simulation experiments evaluated by blocked neural networksComments: 12 pages, 23 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We present a simulation framework for spunbond processes and use a design of experiments to investigate the causeandeffectrelations of process and material parameters onto the fiber laydown on a conveyor belt. The virtual experiments are analyzed by a blocked neural network. This forms the basis for the prediction of the fiber laydown characteristics and enables a quick ranking of the significance of the influencing effects. We conclude our research by an analysis of the nonlinear causeandeffect relations.
 [13] arXiv:1911.06215 [pdf, other]

Title: Sparse Density Estimation with Measurement ErrorsComments: 32 pages, 4 figuresSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG)
This paper aims to build an estimate of an unknown density of the data with measurement error as a linear combination of functions of a dictionary. Inspired by penalization approach, we propose the weighted Elasticnet penalized minimal $L_2$distance method for sparse coefficients estimation, where the weights adaptively coming from sharp concentration inequalities. The optimal weighted tuning parameters are obtained by the firstorder conditions holding with highprobability. Under local coherence or minimal eigenvalue assumptions, nonasymptotical oracle inequalities are derived. These theoretical results are transposed to obtain the support recovery with highprobability. Then, the issue of calibrating these procedures is studied by some numerical experiments for discrete and continuous distributions, it shows the significant improvement obtained by our procedure when compared with other conventional approaches. Finally, the application is performed for a meteorology data set. It shows that our method has potency and superiority of detecting the shape of multimode density compared with other conventional approaches.
 [14] arXiv:1911.06225 [pdf, other]

Title: Location estimation for symmetric logconcave densitiesAuthors: Nilanjana LahaSubjects: Statistics Theory (math.ST)
We revisit the problem of estimating the center of symmetry $\theta$ of an unknown symmetric density $f$. Although Stone (1975), Van Eden (1970), and Sacks (1975) constructed adaptive estimators of $\theta$ in this model, their estimators depend on tuning parameters. In an effort to circumvent the dependence on tuning parameters, we impose an additional assumption of logconcavity on $f$. We show that in this shaperestricted model, the maximum likelihood estimator (MLE) of $\theta$ exists. We also study some truncated onestep estimators and show that they are $\sqrt{n}$consistent, and nearly achieve the asymptotic efficiency bound. We also show that the rate of convergence for the MLE is $O_p(n^{2/5})$. Furthermore, we show that our estimators are robust with respect to the violation of the logconcavity assumption. In fact, we show that the one step estimators are still $\sqrt{n}$consistent under some mild conditions. These analytical conclusions are supported by simulation studies.
 [15] arXiv:1911.06239 [pdf, other]

Title: Unreliable MultiArmed Bandits: A Novel Approach to Recommendation SystemsComments: 4 pages, 4 figures, Aditya Narayan Ravi and Pranav Poduval have equal contributionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
We use a novel modification of MultiArmed Bandits to create a new model for recommendation systems. We model the recommendation system as a bandit seeking to maximize reward by pulling on arms with unknown rewards. The catch however is that this bandit can only access these arms through an unreliable intermediate that has some level of autonomy while choosing its arms. For example, in a streaming website the user has a lot of autonomy while choosing content they want to watch. The streaming sites can use targeted advertising as a means to bias opinions of these users. Here the streaming site is the bandit aiming to maximize reward and the user is the unreliable intermediate. We model the intermediate as accessing states via a Markov chain. The bandit is allowed to perturb this Markov chain. We prove fundamental theorems for this setting after which we show a closetooptimal ExploreCommit algorithm.
 [16] arXiv:1911.06253 [pdf, ps, other]

Title: Understanding Graph Neural Networks with Asymmetric Geometric Scattering TransformsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The scattering transform is a multilayered waveletbased deep learning architecture that acts as a model of convolutional neural networks. Recently, several works have introduced generalizations of the scattering transform for nonEuclidean settings such as graphs. Our work builds upon these constructions by introducing windowed and nonwindowed graph scattering transforms based upon a very general class of asymmetric wavelets. We show that these asymmetric graph scattering transforms have many of the same theoretical guarantees as their symmetric counterparts. This work helps bridge the gap between scattering and other graph neural networks by introducing a large family of networks with provable stability and invariance guarantees. This lays the groundwork for future deep learning architectures for graphstructured data that have learned filters and also provably have desirable theoretical properties.
 [17] arXiv:1911.06287 [pdf, other]

Title: Scalable Exact Inference in MultiOutput Gaussian ProcessesAuthors: Wessel P. Bruinsma, Eric Perim, Will Tebbutt, J. Scott Hosking, Arno Solin, Richard E. TurnerComments: 19 pages, 9 figures, includes appendixSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Multioutput Gaussian processes (MOGPs) leverage the flexibility and interpretability of GPs while capturing structure across outputs, which is desirable, for example, in spatiotemporal modelling. The key problem with MOGPs is the cubic computational scaling in the number of both inputs (e.g., time points or locations), n, and outputs, p. Current methods reduce this to O(n^3 m^3), where m < p is the desired degrees of freedom. This computational cost, however, is still prohibitive in many applications. To address this limitation, we present the Orthogonal Linear Mixing Model (OLMM), an MOGP in which exact inference scales linearly in m: O(n^3 m). This advance opens up a wide range of realworld tasks and can be combined with existing GP approximations in a plugandplay way as demonstrated in the paper. Additionally, the paper organises the existing disparate literature on MOGP models into a simple taxonomy called the Mixing Model Hierarchy (MMH).
 [18] arXiv:1911.06302 [pdf, other]

Title: rFIA: An R package for spacetime estimation of forest attributes with the Forest Inventory and Analysis DatabaseSubjects: Applications (stat.AP)
rFIA is an R package designed to simplify the estimation of forest attributes using the USDA Forest Service Forest Inventory and Analysis (FIA) Database. Specifically, rFIA improves accessibility to the spatiotemporal estimation capacity of the FIA Database via spacetime indexed summaries of forest variables within userdefined population boundaries. Direct integration with other popular R packages (e.g., dplyr, sf, and parallel) facilitates efficient spacetime query and data summary, and supports common data representations and application programming interface (API). The package implements designbased estimation procedures used by the FIA Program, and has been validated against official estimates and sampling errors produced by the FIA Program. We demonstrate the utility of rFIA by assessing changes in abundance and mortality rates of ash populations in the lower peninsula of Michigan following the establishment of emerald ash borer.
Crosslists for Fri, 15 Nov 19
 [19] arXiv:1911.05774 (crosslist from cs.LG) [pdf, ps, other]

Title: Factor GroupSparse Regularization for Efficient LowRank Matrix RecoveryComments: Accepted by NeurIPS 2019Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper develops a new class of nonconvex regularizers for lowrank matrix recovery. Many regularizers are motivated as convex relaxations of the matrix rank function. Our new factor groupsparse regularizers are motivated as a relaxation of the number of nonzero columns in a factorization of the matrix. These nonconvex regularizers are sharper than the nuclear norm; indeed, we show they are related to Schatten$p$ norms with arbitrarily small $0 < p \leq 1$. Moreover, these factor groupsparse regularizers can be written in a factored form that enables efficient and effective nonconvex optimization; notably, the method does not use singular value decomposition. We provide generalization error bounds for lowrank matrix completion which show improved upper bounds for Schatten$p$ norm reglarization as $p$ decreases. Compared to the max norm and the factored formulation of the nuclear norm, factor groupsparse regularizers are more efficient, accurate, and robust to the initial guess of rank. Experiments show promising performance of factor groupsparse regularization for lowrank matrix completion and robust principal component analysis.
 [20] arXiv:1911.05781 (crosslist from cs.LG) [pdf, ps, other]

Title: Learning internal representationsAuthors: Jonathan BaxterJournalref: COLT '95 Proceedings of the eighth annual conference on Computational learning theory (1995) 311320Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Probably the most important problem in machine learning is the preliminary biasing of a learner's hypothesis space so that it is small enough to ensure good generalisation from reasonable training sets, yet large enough that it contains a good solution to the problem being learnt. In this paper a mechanism for {\em automatically} learning or biasing the learner's hypothesis space is introduced. It works by first learning an appropriate {\em internal representation} for a learning environment and then using that representation to bias the learner's hypothesis space for the learning of future tasks drawn from the same environment.
An internal representation must be learnt by sampling from {\em many similar tasks}, not just a single task as occurs in ordinary machine learning. It is proved that the number of examples $m$ {\em per task} required to ensure good generalisation from a representation learner obeys $m = O(a+b/n)$ where $n$ is the number of tasks being learnt and $a$ and $b$ are constants. If the tasks are learnt independently ({\em i.e.} without a common representation) then $m=O(a+b)$. It is argued that for learning environments such as speech and character recognition $b\gg a$ and hence representation learning in these environments can potentially yield a drastic reduction in the number of examples required per task. It is also proved that if $n = O(b)$ (with $m=O(a+b/n)$) then the representation learnt will be good for learning novel tasks from the same environment, and that the number of examples required to generalise well on a novel task will be reduced to $O(a)$ (as opposed to $O(a+b)$ if no representation is used).
It is shown that gradient descent can be used to train neural network representations and experiment results are reported providing strong qualitative support for the theoretical results.  [21] arXiv:1911.05806 (crosslist from cs.LG) [pdf, other]

Title: CoarseRefinement Dilemma: On Generalization Bounds for Data ClusteringComments: 52 pages (in which 5 pages contain references, 1 contains notation, 1 contains dictionary of terms, 2 contain proofs, 5 contain dataset images and 7 contain results)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The Data Clustering (DC) problem is of central importance for the area of Machine Learning (ML), given its usefulness to represent data structural similarities from input spaces. Differently from Supervised Machine Learning (SML), which relies on the theoretical frameworks of the Statistical Learning Theory (SLT) and the Algorithm Stability (AS), DC has scarce literature on generalpurpose learning guarantees, affecting conclusive remarks on how those algorithms should be designed as well as on the validity of their results. In this context, this manuscript introduces a new concept, based on multidimensional persistent homology, to analyze the conditions on which a clustering model is capable of generalizing data. As a first step, we propose a more general definition of DC problem by relying on Topological Spaces, instead of metric ones as typically approached in the literature. From that, we show that the DC problem presents an analogous dilemma to the BiasVariance one, which is here referred to as the CoarseRefinement (CR) dilemma. CR is intended to clarify the contrast between: (i) highlyrefined partitions and the clustering instability (overfitting); and (ii) overcoarse partitions and the lack of representativeness (underfitting); consequently, the CR dilemma suggests the need of a relaxation of Kleinberg's richness axiom. Experimental results were used to illustrate that multidimensional persistent homology support the measurement of divergences among DC models, leading to a consistency criterion.
 [22] arXiv:1911.05811 (crosslist from cs.LG) [pdf, other]

Title: Triply Robust OffPolicy EvaluationComments: Preliminary WorkSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose a robust regression approach to offpolicy evaluation (OPE) for contextual bandits. We frame OPE as a covariateshift problem and leverage modern robust regression tools. Ours is a general approach that can be used to augment any existing OPE method that utilizes the direct method. When augmenting doubly robust methods, we call the resulting method Triply Robust. We prove upper bounds on the resulting bias and variance, as well as derive novel minimax bounds based on robust minimax analysis for covariate shift. Our robust regression method is compatible with deep learning, and is thus applicable to complex OPE settings that require powerful function approximators. Finally, we demonstrate superior empirical performance across the standard OPE benchmarks, especially in the case where the logging policy is unknown and must be estimated from data.
 [23] arXiv:1911.05815 (crosslist from cs.LG) [pdf, other]

Title: Kinematic State Abstraction and Provably Efficient RichObservation Reinforcement LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We present an algorithm, HOMER, for exploration and reinforcement learning in rich observation environments that are summarizable by an unknown latent state space. The algorithm interleaves representation learning to identify a new notion of kinematic state abstraction with strategic exploration to reach new states using the learned abstraction. The algorithm provably explores the environment with sample complexity scaling polynomially in the number of latent states and the time horizon, and, crucially, with no dependence on the size of the observation space, which could be infinitely large. This exploration guarantee further enables sampleefficient global policy optimization for any reward function. On the computational side, we show that the algorithm can be implemented efficiently whenever certain supervised learning problems are tractable. Empirically, we evaluate HOMER on a challenging exploration problem, where we show that the algorithm is exponentially more sample efficient than standard reinforcement learning baselines.
 [24] arXiv:1911.05843 (crosslist from cs.LG) [pdf, other]

Title: TASTE: Temporal and Static Tensor Factorization for Phenotyping Electronic Health RecordsAuthors: Ardavan Afshar, Ioakeim Perros, Haesun Park, Christopher deFilippi, Xiaowei Yan, Walter Stewart, Joyce Ho, Jimeng SunComments: 19 pagesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Phenotyping electronic health records (EHR) focuses on defining meaningful patient groups (e.g., heart failure group and diabetes group) and identifying the temporal evolution of patients in those groups. Tensor factorization has been an effective tool for phenotyping. Most of the existing works assume either a static patient representation with aggregate data or only model temporal data. However, real EHR data contain both temporal (e.g., longitudinal clinical visits) and static information (e.g., patient demographics), which are difficult to model simultaneously. In this paper, we propose Temporal And Static TEnsor factorization (TASTE) that jointly models both static and temporal information to extract phenotypes. TASTE combines the PARAFAC2 model with nonnegative matrix factorization to model a temporal and a static tensor. To fit the proposed model, we transform the original problem into simpler ones which are optimally solved in an alternating fashion. For each of the subproblems, our proposed mathematical reformulations lead to efficient subproblem solvers. Comprehensive experiments on large EHR data from a heart failure (HF) study confirmed that TASTE is up to 14x faster than several baselines and the resulting phenotypes were confirmed to be clinically meaningful by a cardiologist. Using 80 phenotypes extracted by TASTE, a simple logistic regression can achieve the same level of area under the curve (AUC) for HF prediction compared to a deep learning model using recurrent neural networks (RNN) with 345 features.
 [25] arXiv:1911.05861 (crosslist from cs.LG) [pdf, other]

Title: Federated and Differentially Private Learning for Electronic Health RecordsComments: Machine Learning for Health (ML4H) at NeurIPS 2019  Extended AbstractSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The use of collaborative and decentralized machine learning techniques such as federated learning have the potential to enable the development and deployment of clinical risk predictions models in lowresource settings without requiring sensitive data be shared or stored in a central repository. This process necessitates communication of model weights or updates between collaborating entities, but it is unclear to what extent patient privacy is compromised as a result. To gain insight into this question, we study the efficacy of centralized versus federated learning in both private and nonprivate settings. The clinical prediction tasks we consider are the prediction of prolonged length of stay and inhospital mortality across thirty one hospitals in the eICU Collaborative Research Database. We find that while it is straightforward to apply differentially private stochastic gradient descent to achieve strong privacy bounds when training in a centralized setting, it is considerably more difficult to do so in the federated setting.
 [26] arXiv:1911.05873 (crosslist from cs.LG) [pdf, ps, other]

Title: A Reduction from Reinforcement Learning to NoRegret Online LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We present a reduction from reinforcement learning (RL) to noregret online learning based on the saddlepoint formulation of RL, by which "any" online algorithm with sublinear regret can generate policies with provable performance guarantees. This new perspective decouples the RL problem into two parts: regret minimization and function approximation. The first part admits a standard onlinelearning analysis, and the second part can be quantified independently of the learning algorithm. Therefore, the proposed reduction can be used as a tool to systematically design new RL algorithms. We demonstrate this idea by devising a simple RL algorithm based on mirror descent and the generativemodel oracle. For any $\gamma$discounted tabular RL problem, with probability at least $1\delta$, it learns an $\epsilon$optimal policy using at most $\tilde{O}\left(\frac{\mathcal{S}\mathcal{A}\log(\frac{1}{\delta})}{(1\gamma)^4\epsilon^2}\right)$ samples. Furthermore, this algorithm admits a direct extension to linearly parameterized function approximators for largescale applications, with computation and sample complexities independent of $\mathcal{S}$,$\mathcal{A}$, though at the cost of potential approximation bias.
 [27] arXiv:1911.05887 (crosslist from cs.LG) [pdf, other]

Title: Revenue Maximization of Airbnb Marketplace using Search ResultsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Correctly pricing products or services in an online marketplace presents a challenging problem and one of the critical factors for the success of the business. When users are looking to buy an item they typically search for it. Query relevance models are used at this stage to retrieve and rank the items on the search page from most relevant to least relevant. The presented items are naturally "competing" against each other for user purchases. We provide a practical twostage model to price this set of retrieved items for which distributions of their values are learned. The initial output of the pricing strategy is a price vector for the top displayed items in one search event. We later aggregate these results over searches to provide the supplier with the optimal price for each item. We applied our solution to largescale search data obtained from Airbnb Experiences marketplace. Offline evaluation results show that our strategy improves upon baseline pricing strategies on key metrics by at least +20% in terms of booking regret and +55% in terms of revenue potential.
 [28] arXiv:1911.05894 (crosslist from cs.SD) [pdf, other]

Title: Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal SupervisionAuthors: Aren Jansen, Daniel P. W. Ellis, Shawn Hershey, R. Channing Moore, Manoj Plakal, Ashok C. Popat, Rif A. SaurousComments: This extended version of a ICASSP 2020 submission under same title has an added figure and additional discussion for easier consumptionSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomlychosen, explicitlylabeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a selfsupervised objective based on a general notion of unimodal and crossmodal coincidence, (ii) a clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a clusterbased active learning procedure that solicits targeted weak supervision to consolidate categories into relevant semantic classes. By training a combined sound embedding/clustering/classification network according to these criteria, we achieve a new stateoftheart unsupervised audio representation and demonstrate up to a 20fold reduction in the number of labels required to reach a desired classification performance.
 [29] arXiv:1911.05904 (crosslist from cs.LG) [pdf, other]

Title: There is Limited Correlation between Coverage and Robustness for Deep Neural NetworksAuthors: Yizhen Dong, Peixin Zhang, Jingyi Wang, Shuang Liu, Jun Sun, Jianye Hao, Xinyu Wang, Li Wang, Jin Song Dong, Dai TingSubjects: Machine Learning (cs.LG); Software Engineering (cs.SE); Machine Learning (stat.ML)
Deep neural networks (DNN) are increasingly applied in safetycritical systems, e.g., for face recognition, autonomous car control and malware detection. It is also shown that DNNs are subject to attacks such as adversarial perturbation and thus must be properly tested. Many coverage criteria for DNN since have been proposed, inspired by the success of code coverage criteria for software programs. The expectation is that if a DNN is a well tested (and retrained) according to such coverage criteria, it is more likely to be robust. In this work, we conduct an empirical study to evaluate the relationship between coverage, robustness and attack/defense metrics for DNN. Our study is the largest to date and systematically done based on 100 DNN models and 25 metrics. One of our findings is that there is limited correlation between coverage and robustness, i.e., improving coverage does not help improve the robustness. Our dataset and implementation have been made available to serve as a benchmark for future studies on testing DNN.
 [30] arXiv:1911.05909 (crosslist from cs.LG) [pdf, other]

Title: Explainable Ordinal Factorization Model: Deciphering the Effects of Attributes by Piecewise Linear ApproximationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Ordinal regression predicts the objects' labels that exhibit a natural ordering, which is important to many managerial problems such as credit scoring and clinical diagnosis. In these problems, the ability to explain how the attributes affect the prediction is critical to users. However, most, if not all, existing ordinal regression models simplify such explanation in the form of constant coefficients for the main and interaction effects of individual attributes. Such explanation cannot characterize the contributions of attributes at different value scales. To address this challenge, we propose a new explainable ordinal regression model, namely, the Explainable Ordinal Factorization Model (XOFM). XOFM uses the piecewise linear functions to approximate the actual contributions of individual attributes and their interactions. Moreover, XOFM introduces a novel ordinal transformation process to assign each object the probabilities of belonging to multiple relevant classes, instead of fixing boundaries to differentiate classes. XOFM is based on the Factorization Machines to handle the potential sparsity problem as a result of discretizing the attribute scales. Comprehensive experiments with benchmark datasets and baseline models demonstrate that the proposed XOFM exhibits superior explainability and leads to stateoftheart prediction accuracy.
 [31] arXiv:1911.05911 (crosslist from cs.DS) [pdf, ps, other]

Title: Recent Advances in Algorithmic HighDimensional Robust StatisticsSubjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Statistics Theory (math.ST); Machine Learning (stat.ML)
Learning in the presence of outliers is a fundamental problem in statistics. Until recently, all known efficient unsupervised learning algorithms were very sensitive to outliers in high dimensions. In particular, even for the task of robust mean estimation under natural distributional assumptions, no efficient algorithm was known. Recent work in theoretical computer science gave the first efficient robust estimators for a number of fundamental statistical tasks, including mean and covariance estimation. Since then, there has been a flurry of research activity on algorithmic highdimensional robust estimation in a range of settings. In this survey article, we introduce the core ideas and algorithmic techniques in the emerging area of algorithmic highdimensional robust statistics with a focus on robust mean estimation. We also provide an overview of the approaches that have led to computationally efficient robust estimators for a range of broader statistical tasks and discuss new directions and opportunities for future work.
 [32] arXiv:1911.05916 (crosslist from cs.LG) [pdf, other]

Title: Adversarial Margin Maximization NetworksComments: 11 pages + 1 page appendix, accepted by TPAMISubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
The tremendous recent success of deep neural networks (DNNs) has sparked a surge of interest in understanding their predictive ability. Unlike the human visual system which is able to generalize robustly and learn with little supervision, DNNs normally require a massive amount of data to learn new concepts. In addition, research works also show that DNNs are vulnerable to adversarial examplesmaliciously generated images which seem perceptually similar to the natural ones but are actually formed to fool learning models, which means the models have problem generalizing to unseen data with certain type of distortions. In this paper, we analyze the generalization ability of DNNs comprehensively and attempt to improve it from a geometric point of view. We propose adversarial margin maximization (AMM), a learningbased regularization which exploits an adversarial perturbation as a proxy. It encourages a large margin in the input space, just like the support vector machines. With a differentiable formulation of the perturbation, we train the regularized DNNs simply through backpropagation in an endtoend manner. Experimental results on various datasets (including MNIST, CIFAR10/100, SVHN and ImageNet) and different DNN architectures demonstrate the superiority of our method over previous stateofthearts. Code and models for reproducing our results will be made publicly available.
 [33] arXiv:1911.05922 (crosslist from cs.LG) [pdf, other]

Title: Atarifying the Vehicle Routing Problem with Stochastic Service RequestsComments: 11 pages, 4 figuresSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We present a new general approach to modeling research problems as Atarilike videogames to make them amenable to recent groundbreaking solution methods from the deep reinforcement learning community. The approach is flexible, applicable to a wide range of problems. We demonstrate its application on a well known vehicle routing problem. Our preliminary results on this problem, though not transformative, show signs of success and suggest that Atarification may be a useful modeling approach for researchers studying problems involving sequential decision making under uncertainty.
 [34] arXiv:1911.05941 (crosslist from cs.LG) [pdf, other]

Title: An Efficient HardwareOriented Dropout AlgorithmSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper proposes a hardwareoriented dropout algorithm, which is efficient for field programmable gate array (FPGA) implementation. In deep neural networks (DNNs), overfitting occurs when networks are overtrained and adapt too well to training data. Consequently, they fail in predicting unseen data used as test data. Dropout is a common technique that is often applied in DNNs to overcome this problem. In general, implementing such training algorithms of DNNs in embedded systems is difficult due to power and memory constraints. Training DNNs is power, time, and memory intensive; however, embedded systems require low power consumption and realtime processing. An FPGA is suitable for embedded systems for its parallel processing characteristic and low operating power; however, due to its limited memory and different architecture, it is difficult to apply general neural network algorithms. Therefore, we propose a hardwareoriented dropout algorithm that can effectively utilize the characteristics of an FPGA with less memory required. Software program verification demonstrates that the performance of the proposed method is identical to that of conventional dropout, and hardware synthesis demonstrates that it results in significant resource reduction.
 [35] arXiv:1911.05942 (crosslist from cs.CV) [pdf, other]

Title: Progressive Feature Polishing Network for Salient Object DetectionComments: Accepted by AAAI 2020Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Feature matters for salient object detection. Existing methods mainly focus on designing a sophisticated structure to incorporate multilevel features and filter out cluttered features. We present Progressive Feature Polishing Network (PFPN), a simple yet effective framework to progressively polish the multilevel features to be more accurate and representative. By employing multiple Feature Polishing Modules (FPMs) in a recurrent manner, our approach is able to detect salient objects with fine details without any postprocessing. A FPM parallelly updates the features of each level by directly incorporating all higher level context information. Moreover, it can keep the dimensions and hierarchical structures of the feature maps, which makes it flexible to be integrated with any CNNbased models. Empirical experiments show that our results are monotonically getting better with increasing number of FPMs. Without bells and whistles, PFPN outperforms the stateoftheart methods significantly on five benchmark datasets under various evaluation metrics.
 [36] arXiv:1911.05944 (crosslist from cs.LG) [pdf, other]

Title: 2L3W: 2Level 3Way HardwareSoftware CoVerification for the Mapping of Deep Learning Architecture (DLA) onto FPGA BoardsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
FPGAs have become a popular choice for deploying deep learning architectures (DLA). There are many researchers that have explored the deployment and mapping of DLA on FPGA. However, there has been a growing need to do designtime hardwaresoftware coverification of these deployments. To the best of our knowledge this is the first work that proposes a 2Level 3Way (2L3W) hardwaresoftware coverification methodology and provides a stepbystep guide for the successful mapping, deployment and verification of DLA on FPGA boards. The 2Level verification is to make sure the implementation in each stage (software and hardware) are following the desired behavior. The 3Way coverification provides a crossparadigm (software, design and hardware) layerbylayer parameter check to assure the correct implementation and mapping of the DLA onto FPGA boards. The proposed 2L3W coverification methodology has been evaluated over several test cases. In each case, the prediction and layerbylayer output of the DLA deployed on PYNQ FPGA board (hardware) alongside with the intermediate design results of the layerbylayer output of the DLA implemented on Vivado HLS and the prediction and layerbylayer output of the software level (Caffe deep learning framework) are compared to obtain a layerbylayer similarity score. The comparison is achieved using a completely automated Python script. The comparison provides a layerbylayer similarity score that informs us the degree of success of the DLA mapping to the FPGA or help identify in design time the layer to be debugged in the case of unsuccessful mapping. We demonstrated our technique on LeNet DLA and Caffe inspired Cifar10 DLA and the coverification results yielded layerbylayer similarity scores of 99\% accuracy.
 [37] arXiv:1911.05949 (crosslist from cs.LG) [pdf, ps, other]

Title: Online Second Price Auction with Semibandit Feedback Under the NonStationary SettingComments: Accepted to AAAI20Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
In this paper, we study the nonstationary online second price auction problem. We assume that the seller is selling the same type of items in $T$ rounds by the second price auction, and she can set the reserve price in each round. In each round, the bidders draw their private values from a joint distribution unknown to the seller. Then, the seller announced the reserve price in this round. Next, bidders with private values higher than the announced reserve price in that round will report their values to the seller as their bids. The bidder with the highest bid larger than the reserved price would win the item and she will pay to the seller the price equal to the secondhighest bid or the reserve price, whichever is larger. The seller wants to maximize her total revenue during the time horizon $T$ while learning the distribution of private values over time. The problem is more challenging than the standard online learning scenario since the private value distribution is nonstationary, meaning that the distribution of bidders' private values may change over time, and we need to use the \emph{nonstationary regret} to measure the performance of our algorithm. To our knowledge, this paper is the first to study the repeated auction in the nonstationary setting theoretically. Our algorithm achieves the nonstationary regret upper bound $\tilde{\mathcal{O}}(\min\{\sqrt{\mathcal S T}, \bar{\mathcal{V}}^{\frac{1}{3}}T^{\frac{2}{3}}\})$, where $\mathcal S$ is the number of switches in the distribution, and $\bar{\mathcal{V}}$ is the sum of total variation, and $\mathcal S$ and $\bar{\mathcal{V}}$ are not needed to be known by the algorithm. We also prove regret lower bounds $\Omega(\sqrt{\mathcal S T})$ in the switching case and $\Omega(\bar{\mathcal{V}}^{\frac{1}{3}}T^{\frac{2}{3}})$ in the dynamic case, showing that our algorithm has nearly optimal \emph{nonstationary regret}.
 [38] arXiv:1911.05952 (crosslist from qfin.ST) [pdf, other]

Title: Changepoint Analysis in Financial NetworksSubjects: Statistical Finance (qfin.ST); Applications (stat.AP)
A major impact of globalization has been the information flow across the financial markets rendering them vulnerable to financial contagion. Research has focused on network analysis techniques to understand the extent and nature of such information flow. It is now an established fact that a stock market crash in one country can have a serious impact on other markets across the globe. It follows that such crashes or critical regimes will affect the network dynamics of the global financial markets. In this paper, we use sequential change point detection in dynamic networks to detect changes in the network characteristics of thirteen stock markets across the globe. Our method helps us to detect changes in network behavior across all known stock market crashes during the period of study. In most of the cases, we can detect a change in the network characteristics prior to crash. Our work thus opens the possibility of using this technique to create a warning bell for critical regimes in financial markets.
 [39] arXiv:1911.05954 (crosslist from cs.LG) [pdf, other]

Title: Hierarchical Graph Pooling with Structure LearningComments: Accepted to AAAI2020; Code is available at this https URLSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Graph Neural Networks (GNNs), which generalize deep neural networks to graphstructured data, have drawn considerable attention and achieved stateoftheart performance in numerous graph related tasks. However, existing GNN models mainly focus on designing graph convolution operations. The graph pooling (or downsampling) operations, that play an important role in learning hierarchical representations, are usually overlooked. In this paper, we propose a novel graph pooling operator, called Hierarchical Graph Pooling with Structure Learning (HGPSL), which can be integrated into various graph neural network architectures. HGPSL incorporates graph pooling and structure learning into a unified module to generate hierarchical representations of graphs. More specifically, the graph pooling operation adaptively selects a subset of nodes to form an induced subgraph for the subsequent layers. To preserve the integrity of graph's topological information, we further introduce a structure learning mechanism to learn a refined graph structure for the pooled graph at each layer. By combining HGPSL operator with graph neural networks, we perform graph level representation learning with focus on graph classification task. Experimental results on six widely used benchmarks demonstrate the effectiveness of our proposed model.
 [40] arXiv:1911.05956 (crosslist from cs.LG) [pdf, other]

Title: Contextual Bandits Evolving Over Finite TimeSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Contextual bandits have the same explorationexploitation tradeoff as standard multiarmed bandits. On adding positive externalities that decay with time, this problem becomes much more difficult as wrong decisions at the start are hard to recover from. We explore existing policies in this setting and highlight their biases towards the inherent reward matrix. We propose a rejection based policy that achieves a low regret irrespective of the structure of the reward probability matrix.
 [41] arXiv:1911.05990 (crosslist from cs.LG) [pdf, other]

Title: Attention on Abstract Visual ReasoningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Attention mechanisms have been boosting the performance of deep learning models on a wide range of applications, ranging from speech understanding to program induction. However, despite experiments from psychology which suggest that attention plays an essential role in visual reasoning, the full potential of attention mechanisms has so far not been explored to solve abstract cognitive tasks on image data. In this work, we propose a hybrid network architecture, grounded on selfattention and relational reasoning. We call this new model Attention Relation Network (ARNe). ARNe combines features from the recently introduced Transformer and the Wild Relation Network (WReN). We test ARNe on the Procedurally Generated Matrices (PGMs) datasets for abstract visual reasoning. ARNe excels the WReN model on this task by 11.28 ppt. Relational concepts between objects are efficiently learned demanding only 35% of the training samples to surpass reported accuracy of the base line model. Our proposed hybrid model, represents an alternative on learning abstract relations using selfattention and demonstrates that the Transformer network is also well suited for abstract visual reasoning.
 [42] arXiv:1911.05996 (crosslist from cs.LG) [pdf, other]

Title: Privacy and Utility Preserving SensorData TransformationsComments: Accepted to appear in Pervasive and Mobile computing (PMC) Journal, ElsevierSubjects: Machine Learning (cs.LG); HumanComputer Interaction (cs.HC); Signal Processing (eess.SP); Machine Learning (stat.ML)
Sensitive inferences and user reidentification are major threats to privacy when raw sensor data from wearable or portable devices are shared with cloudassisted applications. To mitigate these threats, we propose mechanisms to transform sensor data before sharing them with applications running on users' devices. These transformations aim at eliminating patterns that can be used for user reidentification or for inferring potentially sensitive activities, while introducing a minor utility loss for the target application (or task). We show that, on gesture and activity recognition tasks, we can prevent inference of potentially sensitive activities while keeping the reduction in recognition accuracy of nonsensitive activities to less than 5 percentage points. We also show that we can reduce the accuracy of user reidentification and of the potential inference of gender to the level of a random guess, while keeping the accuracy of activity recognition comparable to that obtained on the original data.
 [43] arXiv:1911.05999 (crosslist from cs.LG) [pdf, other]

Title: An Application of MultipleInstance Learning to Estimate Generalization RiskAuthors: Daiki SuehiroSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We focus on several learning approaches that employ maxoperator to evaluate the margin. For example, such approaches are commonly used in multiclass learning task and toprank learning task. In general, in order to estimate the theoretical generalization risk, we need to individually evaluate the complexity of each hypothesis class used in the learning approaches. In this paper, we provide a technique to estimate a theoretical generalization risk for such learning approaches in a same fashion. The key idea is to "redundantly" reformulate the learning problem as oneclass multipleinstance learning by redefining the specific input space based on the original input space. Surprisingly, we succeed to improve the generalization risk bounds for some multiclass learning and toprank learning algorithms.
 [44] arXiv:1911.06009 (crosslist from cs.LG) [pdf, other]

Title: A Recurrent Probabilistic Neural Network with Dimensionality Reduction Based on Timeseries Discriminant Component AnalysisComments: Published in IEEE Transactions on Neural Networks and Learning SystemsJournalref: IEEE Transactions on Neural Networks and Learning Systems, Vol. 26, No.12, pp. 30213033, 2015Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper proposes a probabilistic neural network developed on the basis of timeseries discriminant component analysis (TSDCA) that can be used to classify highdimensional timeseries patterns. TSDCA involves the compression of highdimensional time series into a lowerdimensional space using a set of orthogonal transformations and the calculation of posterior probabilities based on a continuousdensity hidden Markov model with a Gaussian mixture model expressed in the reduceddimensional space. The analysis can be incorporated into a neural network, which is named a timeseries discriminant component network (TSDCN), so that parameters of dimensionality reduction and classification can be obtained simultaneously as network coefficients according to a backpropagation through timebased learning algorithm with the Lagrange multiplier method. The TSDCN is considered to enable highaccuracy classification of highdimensional timeseries patterns and to reduce the computation time taken for network training. The validity of the TSDCN is demonstrated for highdimensional artificial data and EEG signals in the experiments conducted during the study.
 [45] arXiv:1911.06015 (crosslist from cs.LG) [pdf, other]

Title: Robust ParameterFree Season Length Detection in Time SeriesComments: MileTS 2017Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The indepth analysis of time series has gained a lot of research interest in recent years, with the identification of periodic patterns being one important aspect. Many of the methods for identifying periodic patterns require time series' season length as input parameter. There exist only a few algorithms for automatic season length approximation. Many of these rely on simplifications such as data discretization and user defined parameters. This paper presents an algorithm for season length detection that is designed to be sufficiently reliable to be used in practical applications and does not require any input other than the time series to be analyzed. The algorithm estimates a time series' season length by interpolating, filtering and detrending the data. This is followed by analyzing the distances between zeros in the directly corresponding autocorrelation function. Our algorithm was tested against a comparable algorithm and outperformed it by passing 122 out of 165 tests, while the existing algorithm passed 83 tests. The robustness of our method can be jointly attributed to both the algorithmic approach and also to design decisions taken at the implementational level.
 [46] arXiv:1911.06028 (crosslist from cs.LG) [pdf, other]

Title: SDGM: Sparse Bayesian Classifier Based on a Discriminative Gaussian Mixture ModelSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In probabilistic classification, a discriminative model based on Gaussian mixture exhibits flexible fitting capability. Nevertheless, it is difficult to determine the number of components. We propose a sparse classifier based on a discriminative Gaussian mixture model (GMM), which is named sparse discriminative Gaussian mixture (SDGM). In the SDGM, a GMMbased discriminative model is trained by sparse Bayesian learning. This learning algorithm improves the generalization capability by obtaining a sparse solution and automatically determines the number of components by removing redundant components. The SDGM can be embedded into neural networks (NNs) such as convolutional NNs and can be trained in an endtoend manner. Experimental results indicated that the proposed method prevented overfitting by obtaining sparsity. Furthermore, we demonstrated that the proposed method outperformed a fully connected layer with the softmax function in certain cases when it was used as the last layer of a deep NN.
 [47] arXiv:1911.06048 (crosslist from cs.LG) [pdf, other]

Title: Conjugate Gradients for Kernel MachinesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Regularized leastsquares (kernelridge / Gaussian process) regression is a fundamental algorithm of statistics and machine learning. Because generic algorithms for the exact solution have cubic complexity in the number of datapoints, large datasets require to resort to approximations. In this work, the computation of the leastsquares prediction is itself treated as a probabilistic inference problem. We propose a structured Gaussian regression model on the kernel function that uses projections of the kernel matrix to obtain a lowrank approximation of the kernel and the matrix. A central result is an enhanced way to use the method of conjugate gradients for the specific setting of leastsquares regression as encountered in machine learning. Our method improves the approximation of the kernel ridge regressor / Gaussian process posterior mean over vanilla conjugate gradients and, allows computation of the posterior variance and the log marginal likelihood (evidence) without further overhead.
 [48] arXiv:1911.06057 (crosslist from cs.LG) [pdf, other]

Title: Supplementary material for Uncorrected leastsquares temporal difference with lambdareturnAuthors: Takayuki OsogamiComments: 9 pages, supplementary material for an AAAI20 paperSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Here, we provide a supplementary material for Takayuki Osogami, "Uncorrected leastsquares temporal difference with lambdareturn," which appears in {\it Proceedings of the 34th AAAI Conference on Artificial Intelligence} (AAAI20).
 [49] arXiv:1911.06106 (crosslist from qbio.BM) [pdf]

Title: AMP0: SpeciesSpecific Prediction of Antimicrobial Peptides using Zero and Few Shot LearningComments: Under journal submission, 2019Subjects: Biomolecules (qbio.BM); Machine Learning (cs.LG); Machine Learning (stat.ML)
The evolution of drugresistant microbial species is one of the major challenges to global health. The development of new antimicrobial treatments such as antimicrobial peptides needs to be accelerated to combat this threat. However, the discovery of novel antimicrobial peptides is hampered by lowthroughput biochemical assays. Computational techniques can be used for rapid screening of promising antimicrobial peptide candidates prior to testing in the wet lab. The vast majority of existing antimicrobial peptide predictors are nontargeted in nature, i.e., they can predict whether a given peptide sequence is antimicrobial, but they are unable to predict whether the sequence can target a particular microbial species. In this work, we have developed a targeted antimicrobial peptide activity predictor that can predict whether a peptide is effective against a given microbial species or not. This has been made possible through zeroshot and fewshot machine learning. The proposed predictor called AMP0 takes in the peptide amino acid sequence and any N/Ctermini modifications together with the genomic sequence of a target microbial species to generate targeted predictions. It is important to note that the proposed method can generate predictions for species that are not part of its training set. The accuracy of predictions for novel test species can be further improved by providing a few example peptides for that species. Our computational crossvalidation results show that the proposed scheme is particularly effective for targeted antimicrobial prediction in comparison to existing approaches and can be used for screening potential antimicrobial peptides in a targeted manner especially for cases in which the number of training examples is small. The webserver of the method is available at this http URL
 [50] arXiv:1911.06107 (crosslist from qbio.BM) [pdf, other]

Title: Earthmoverbased manifold learning for analyzing molecular conformation spacesComments: 5 pages, 4 figures, 1 tableSubjects: Biomolecules (qbio.BM); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
In this paper, we propose a novel approach for manifold learning that combines the Earthmover's distance (EMD) with the diffusion maps method for dimensionality reduction. We demonstrate the potential benefits of this approach for learning shape spaces of proteins and other flexible macromolecules using a simulated dataset of 3D density maps that mimic the nonuniform rotary motion of ATP synthase. Our results show that EMDbased diffusion maps require far fewer samples to recover the intrinsic geometry than the standard diffusion maps algorithm that is based on the Euclidean distance. To reduce the computational burden of calculating the EMD for all volume pairs, we employ a waveletbased approximation to the EMD which reduces the computation of the pairwise EMDs to a computation of pairwise weighted$\ell_1$ distances between wavelet coefficient vectors.
 [51] arXiv:1911.06111 (crosslist from cs.CL) [pdf, other]

Title: Instancebased Transfer Learning for Multilingual Deep RetrievalSubjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
Perhaps the simplest type of multilingual transfer learning is instancebased transfer learning, in which data from the target language and the auxiliary languages are pooled, and a single model is learned from the pooled data. It is not immediately obvious when instancebased transfer learning will improve performance in this multilingual setting: for instance, a plausible conjecture is this kind of transfer learning would help only if the auxiliary languages were very similar to the target. Here we show that at large scale, this method is surprisingly effective, leading to positive transfer on all of 35 target languages we tested. We analyze this improvement and argue that the most natural explanation, namely direct vocabulary overlap between languages, only partially explains the performance gains: in fact, we demonstrate targetlanguage improvement can occur after adding data from an auxiliary language with no vocabulary in common with the target. This surprising result is due to the effect of transitive vocabulary overlaps between pairs of auxiliary and target languages.
 [52] arXiv:1911.06118 (crosslist from cs.CL) [pdf, ps, other]

Title: Learning MultiSense Word Distributions using Approximate KullbackLeibler DivergenceComments: 7 pages, 4 tablesSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Learning word representations has garnered greater attention in the recent past due to its diverse text applications. Word embeddings encapsulate the syntactic and semantic regularities of sentences. Modelling word embedding as multisense gaussian mixture distributions, will additionally capture uncertainty and polysemy of words. We propose to learn the Gaussian mixture representation of words using a KullbackLeibler (KL) divergence based objective function. The KL divergence based energy function provides a better distance metric which can effectively capture entailment and distribution similarity among the words. Due to the intractability of KL divergence for Gaussian mixture, we go for a KL approximation between Gaussian mixtures. We perform qualitative and quantitative experiments on benchmark word similarity and entailment datasets which demonstrate the effectiveness of the proposed approach.
 [53] arXiv:1911.06129 (crosslist from cs.LG) [pdf, ps, other]

Title: A Bayesian/Information Theoretic Model of Bias LearningAuthors: Jonathan BaxterJournalref: COLT 96 Proceedings of the ninth annual conference on Computational learning theory (1996) Pages 7788Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper the problem of learning appropriate bias for an environment of related tasks is examined from a Bayesian perspective. The environment of related tasks is shown to be naturally modelled by the concept of an {\em objective} prior distribution. Sampling from the objective prior corresponds to sampling different learning tasks from the environment. It is argued that for many common machine learning problems, although we don't know the true (objective) prior for the problem, we do have some idea of a set of possible priors to which the true prior belongs. It is shown that under these circumstances a learner can use Bayesian inference to learn the true prior by sampling from the objective prior. Bounds are given on the amount of information required to learn a task when it is simultaneously learnt with several other tasks. The bounds show that if the learner has little knowledge of the true prior, and the dimensionality of the true prior is small, then sampling multiple tasks is highly advantageous.
 [54] arXiv:1911.06154 (crosslist from cs.CL) [pdf, other]

Title: A Massive Collection of CrossLingual WebDocument PairsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Crosslingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Smallscale efforts have been made to collect aligned document level data on a limited set of languagepairs such as EnglishGerman or on limited comparable collections such as Wikipedia. In this paper, we mine twelve snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of 54 million URL pairs from Common Crawl covering documents in 92 languages paired with English. We evaluate the quality of the dataset by measuring the quality of machine translations from models that have been trained on mined parallel sentence pairs from this aligned corpora and introduce a simple yet effective baseline for identifying these aligned documents. The objective of this dataset and paper is to foster new research in crosslingual NLP across a variety of low, mid, and highresource languages.
 [55] arXiv:1911.06156 (crosslist from cs.CL) [pdf, other]

Title: SyntaxInfused Transformer and BERT models for Machine Translation and Natural Language UnderstandingAuthors: Dhanasekar Sundararaman, Vivek Subramanian, Guoyin Wang, Shijing Si, Dinghan Shen, Dong Wang, Lawrence CarinSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Attentionbased models have shown significant improvement over traditional algorithms in several NLP tasks. The Transformer, for instance, is an illustrative example that generates abstract representations of tokens inputted to an encoder based on their relationships to all tokens in a sequence. Recent studies have shown that although such models are capable of learning syntactic features purely by seeing examples, explicitly feeding this information to deep learning models can significantly enhance their performance. Leveraging syntactic information like part of speech (POS) may be particularly beneficial in limited training data settings for complex models such as the Transformer. We show that the syntaxinfused Transformer with multiple features achieves an improvement of 0.7 BLEU when trained on the full WMT 14 English to German translation dataset and a maximum improvement of 1.99 BLEU points when trained on a fraction of the dataset. In addition, we find that the incorporation of syntax into BERT finetuning outperforms baseline on a number of downstream tasks from the GLUE benchmark.
 [56] arXiv:1911.06164 (crosslist from cs.LG) [pdf, ps, other]

Title: Learning Model BiasAuthors: Jonathan BaxterJournalref: Advances in Neural Information Processing Systems 8, 1995, 169175Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper the problem of {\em learning} appropriate domainspecific bias is addressed. It is shown that this can be achieved by learning many related tasks from the same domain, and a theorem is given bounding the number tasks that must be learnt. A corollary of the theorem is that if the tasks are known to possess a common {\em internal representation} or {\em preprocessing} then the number of examples required per task for good generalisation when learning $n$ tasks simultaneously scales like $O(a + \frac{b}{n})$, where $O(a)$ is a bound on the minimum number of examples required to learn a single task, and $O(a + b)$ is a bound on the number of examples required to learn each task independently. An experiment providing strong qualitative support for the theoretical results is reported.
 [57] arXiv:1911.06182 (crosslist from cs.CL) [pdf, other]

Title: MML: Maximal Multiverse Learning for Robust FineTuning of Language ModelsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Recent stateoftheart language models utilize a twophase training procedure comprised of (i) unsupervised pretraining on unlabeled text, and (ii) finetuning for a specific supervised task. More recently, many studies have been focused on trying to improve these models by enhancing the pretraining phase, either via better choice of hyperparameters or by leveraging an improved formulation. However, the pretraining phase is computationally expensive and often done on private datasets. In this work, we present a method that leverages BERT's finetuning phase to its fullest, by applying an extensive number of parallel classifier heads, which are enforced to be orthogonal, while adaptively eliminating the weaker heads during training. Our method allows the model to converge to an optimal number of parallel classifiers, depending on the given dataset at hand.
We conduct an extensive inter and intradataset evaluations, showing that our method improves the robustness of BERT, sometimes leading to a +9\% gain in accuracy. These results highlight the importance of a proper finetuning procedure, especially for relatively smallersized datasets. Our code is attached as supplementary and our models will be made completely public.  [58] arXiv:1911.06187 (crosslist from math.AP) [pdf]

Title: Concordance probability in a big data setting: application in nonlife insuranceSubjects: Analysis of PDEs (math.AP); Machine Learning (stat.ML)
The concordance probability or Cindex is a popular measure to capture the discriminatory ability of a regression model. In this article, the definition of this measure is adapted to the specific needs of the frequency and severity model, typically used during the technical pricing of a nonlife insurance product. Due to the typical large sample size of the frequency data in particular, two different adaptations of the estimation procedure of the concordance probability are presented. Note that the latter procedures can be applied to all different versions of the concordance probability.
 [59] arXiv:1911.06190 (crosslist from eess.SP) [pdf, other]

Title: An Improved Tobit Kalman Filter with Adaptive Censoring LimitsComments: 21 pages, 32 figuresSubjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST)
This paper deals with the Tobit Kalman filtering (TKF) process when the measurements are correlated and censored. The case of interval censoring, i.e., the case of measurements which belong to some interval with given censoring limits, is considered. Two improvements of the standard TKF process are proposed, in order to estimate the hidden state vectors. Firstly, the exact covariance matrix of the censored measurements is calculated by taking into account the censoring limits. Secondly, the probability of a latent (normally distributed) measurement to belong in or out of the uncensored region is calculated by taking into account the Kalman residual. The designed algorithm is tested using both synthetic and real data sets. The real data set includes human skeleton joints' coordinates captured by the Microsoft Kinect II sensor. In order to cope with certain reallife situations that cause problems in human skeleton tracking, such as (self)occlusions, closely interacting persons etc., adaptive censoring limits are used in the proposed TKF process. Experiments show that the proposed method outperforms other filtering processes in minimizing the overall Root Mean Square Error (RMSE) for synthetic and real data sets.
 [60] arXiv:1911.06191 (crosslist from cs.CL) [pdf, other]

Title: Microsoft Research Asia's Systems for WMT19Authors: Yingce Xia, Xu Tan, Fei Tian, Fei Gao, Weicong Chen, Yang Fan, Linyuan Gong, Yichong Leng, Renqian Luo, Yiren Wang, Lijun Wu, Jinhua Zhu, Tao Qin, TieYan LiuComments: Accepted to "Fourth Conference on Machine Translation (WMT19)"Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
We Microsoft Research Asia made submissions to 11 language directions in the WMT19 news translation tasks. We won the first place for 8 of the 11 directions and the second place for the other three. Our basic systems are built on Transformer, back translation and knowledge distillation. We integrate several of our rececent techniques to enhance the baseline systems: multiagent dual learning (MADL), masked sequencetosequence pretraining (MASS), neural architecture optimization (NAO), and soft contextual data augmentation (SCA).
 [61] arXiv:1911.06192 (crosslist from cs.CL) [pdf, other]

Title: Multidomain Dialogue State Tracking as Dynamic Knowledge Graph Enhanced Question AnsweringSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Multidomain dialogue state tracking (DST) is a critical component for conversational AI systems. The domain ontology (i.e., specification of domains, slots, and values) of a conversational AI system is generally incomplete, making the capability for DST models to generalize to new slots, values, and domains during inference imperative. In this paper, we propose to model multidomain DST as a question answering problem, referred to as Dialogue State Tracking via Question Answering (DSTQA). Within DSTQA, each turn generates a question asking for the value of a (domain, slot) pair, thus making it naturally extensible to unseen domains, slots, and values. Additionally, we use a dynamicallyevolving knowledge graph to explicitly learn relationships between (domain, slot) pairs. Our model has a 5.80% and 12.21% relative improvement over the current stateoftheart model on MultiWOZ 2.0 and MultiWOZ 2.1 datasets, respectively. Additionally, our model consistently outperforms the stateoftheart model in domain adaptation settings.
 [62] arXiv:1911.06194 (crosslist from cs.CL) [pdf, other]

Title: Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence ModelsComments: 12 pages, 9 figuresSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
The impressive performance of neural networks on natural language processing tasks attributes to their ability to model complicated word and phrase interactions. Existing flat, word level explanations of predictions hardly unveil how neural networks handle compositional semantics to reach predictions. To tackle the challenge, we study hierarchical explanation of neural network predictions. We identify nonadditivity and independent importance attributions within hierarchies as two desirable properties for highlighting word and phrase interactions. We show prior efforts on hierarchical explanations, e.g. contextual decomposition, however, do not satisfy the desired properties mathematically. In this paper, we propose a formal way to quantify the importance of each word or phrase for hierarchical explanations. Following the formulation, we propose Sampling and Contextual Decomposition (SCD) algorithm and Sampling and Occlusion (SOC) algorithm. Human and metrics evaluation on both LSTM models and BERT Transformer models on multiple datasets show that our algorithms outperform prior hierarchical explanation algorithms. Our algorithms apply to hierarchical visualization of compositional semantics, extraction of classification rules and improving human trust of models.
 [63] arXiv:1911.06197 (crosslist from cs.CL) [pdf]

Title: Towards automatic extractive text summarization of A133 Single Audit reports with machine learningComments: 8 pages, first versionSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
The rapid growth of text data has motivated the development of machinelearning based automatic text summarization strategies that concisely capture the essential ideas in a larger text. This study aimed to devise an extractive summarization method for A133 Single Audits, which assess if recipients of federal grants are compliant with program requirements for use of federal funding. Currently, these voluminous audits must be manually analyzed by officials for oversight, risk management, and prioritization purposes. Automated summarization has the potential to streamline these processes. Analysis focused on the "Findings" section of ~20,000 Single Audits spanning 20162018. Following text preprocessing and GloVe embedding, sentencelevel kmeans clustering was performed to partition sentences by topic and to establish the importance of each sentence. For each audit, key summary sentences were extracted by proximity to cluster centroids. Summaries were judged by nonexpert human evaluation and compared to humangenerated summaries using the ROUGE metric. Though the goal was to fully automate summarization of A133 audits, human input was required at various stages due to large variability in audit writing style, content, and context. Examples of human inputs include the number of clusters, the choice to keep or discard certain clusters based on their content relevance, and the definition of a top sentence. Overall, this approach made progress towards automated extractive summaries of A133 audits, with future work to focus on full automation and improving summary consistency. This work highlights the inherent difficulty and subjective nature of automated summarization in a realworld application.
 [64] arXiv:1911.06204 (crosslist from condmat.statmech) [pdf, other]

Title: Estimating differential entropy using recursive copula splittingSubjects: Statistical Mechanics (condmat.statmech); Statistics Theory (math.ST)
A method for estimating the Shannon differential entropy of multidimensional random variables using independent samples is described. The method is based on decomposing the distribution into a product of the marginal distributions and the joint dependency, also known as the copula. The entropy of marginals is estimated using onedimensional methods. The entropy of the copula, which always has a compact support, is estimated recursively by splitting the data along statistically dependent dimensions. Numerical examples demonstrate that the method is accurate for distributions with compact and noncompact supports, which is imperative when the support is not known or of mixed type (in different dimensions). At high dimensions (larger than 20), our method is not only more accurate, but also significantly more efficient than existing approaches.
 [65] arXiv:1911.06217 (crosslist from cs.LG) [pdf, other]

Title: On Network Embedding for Machine Learning on Road Networks: A Case Study on the Danish Road NetworkComments: \c{opyright} 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksJournalref: 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 34223431Subjects: Machine Learning (cs.LG); Databases (cs.DB); Machine Learning (stat.ML)
Road networks are a type of spatial network, where edges may be associated with qualitative information such as road type and speed limit. Unfortunately, such information is often incomplete; for instance, OpenStreetMap only has speed limits for 13% of all Danish road segments. This is problematic for analysis tasks that rely on such information for machine learning. To enable machine learning in such circumstances, one may consider the application of network embedding methods to extract structural information from the network. However, these methods have so far mostly been used in the context of social networks, which differ significantly from road networks in terms of, e.g., node degree and level of homophily (which are key to the performance of many network embedding methods). We analyze the use of network embedding methods, specifically node2vec, for learning road segment embeddings in road networks. Due to the often limited availability of information on other relevant road characteristics, the analysis focuses on leveraging the spatial network structure. Our results suggest that network embedding methods can indeed be used for deriving relevant network features (that may, e.g, be used for predicting speed limits), but that the qualities of the embeddings differ from embeddings for social networks.
 [66] arXiv:1911.06242 (crosslist from eess.SP) [pdf, other]

Title: Condition monitoring and early diagnostics methodologies for hydropower plantsAuthors: Alessandro Betti (1), Emanuele Crisostomi (2), Gianluca Paolinelli (3), Antonio Piazzi (1), Fabrizio Ruffini (1), Mauro Tucci (2) ((1) iEM S.r.l., (2) Department of Energy, Systems, Territory and Constructions Engineering, University of Pisa and (3) Pure Power Control S.r.l.)Comments: 8 pages, 4 figures. This work has been submitted to the Elsevier Renewable Energy for possible publicationSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
Hydropower plants are one of the most convenient option for power generation, as they generate energy exploiting a renewable source, they have relatively low operating and maintenance costs, and they may be used to provide ancillary services, exploiting the large reservoirs of available water. The recent advances in Information and Communication Technologies (ICT) and in machine learning methodologies are seen as fundamental enablers to upgrade and modernize the current operation of most hydropower plants, in terms of condition monitoring, early diagnostics and eventually predictive maintenance. While very few works, or running technologies, have been documented so far for the hydro case, in this paper we propose a novel Key Performance Indicator (KPI) that we have recently developed and tested on operating hydropower plants. In particular, we show that after more than one year of operation it has been able to identify several faults, and to support the operation and maintenance tasks of plant operators. Also, we show that the proposed KPI outperforms conventional multivariable process control charts, like the Hotelling $t_2$ index.
 [67] arXiv:1911.06256 (crosslist from cs.LG) [pdf, other]

Title: A Comparative Study between Bayesian and Frequentist Neural Networks for Remaining Useful Life Estimation in ConditionBased MaintenanceAuthors: Luca Della LiberaSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In the last decade, deep learning (DL) has outperformed modelbased and statistical approaches in predicting the remaining useful life (RUL) of machinery in the context of conditionbased maintenance. One of the major drawbacks of DL is that it heavily depends on a large amount of labeled data, which are typically expensive and timeconsuming to obtain, especially in industrial applications. Scarce training data lead to uncertain estimates of the model's parameters, which in turn result in poor prognostic performance. Quantifying this parameter uncertainty is important in order to determine how reliable the prediction is. Traditional DL techniques such as neural networks are incapable of capturing the uncertainty in the training data, thus they are overconfident about their estimates. On the contrary, Bayesian deep learning has recently emerged as a promising solution to account for uncertainty in the training process, achieving stateoftheart performance in many classification and regression tasks. In this work Bayesian DL techniques such as Bayesian dense neural networks and Bayesian convolutional neural networks are applied to RUL estimation and compared to their frequentist counterparts from the literature. The effectiveness of the proposed models is verified on the popular CMAPSS dataset. Furthermore, parameter uncertainty is quantified and used to gain additional insight into the data.
 [68] arXiv:1911.06257 (crosslist from cs.LG) [pdf, other]

Title: ViWi: A Deep Learning Dataset Framework for VisionAided Wireless CommunicationsComments: The ViWi datasets and applications are available at this https URLSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Machine Learning (stat.ML)
The growing role that artificial intelligence and specifically machine learning is playing in shaping the future of wireless communications has opened up many new and intriguing research directions. This paper motivates the research in the novel direction of \textit{visionaided wireless communications}, which aims at leveraging visual sensory information in tackling wireless communication problems. Like any new research direction driven by machine learning, obtaining a development dataset poses the first and most important challenge to visionaided wireless communications. This paper addresses this issue by introducing the VisionWireless (ViWi) dataset framework. It is developed to be a parametric, systematic, and scalable data generation framework. It utilizes advanced 3Dmodeling and raytracing softwares to generate highfidelity synthetic wireless and vision data samples for the same scenes. The result is a framework that does not only offer a way to generate training and testing datasets but helps provide a common ground on which the quality of different machine learningpowered solutions could be assessed.
 [69] arXiv:1911.06267 (crosslist from quantph) [pdf, other]

Title: A regression algorithm for accelerated lattice QCD that exploits sparse inference on the DWave quantum annealerComments: 6 pages, 4 figuresSubjects: Quantum Physics (quantph); High Energy Physics  Lattice (heplat); Machine Learning (stat.ML)
We propose a regression algorithm that utilizes a learned dictionary optimized for sparse inference on DWave quantum annealer. In this regression algorithm, we concatenate the independent and dependent variables as an combined vector, and encode the highorder correlations between them into a dictionary optimized for sparse reconstruction. On a test dataset, the dependent variable is initialized to its average value and then a sparse reconstruction of the combined vector is obtained in which the dependent variable is typically shifted closer to its true value, as in a standard inpainting or denoising task. Here, a quantum annealer, which can presumably exploit a fully entangled initial state to better explore the complex energy landscape, is used to solve the highly nonconvex sparse coding optimization problem. The regression algorithm is demonstrated for a lattice quantum chromodynamics simulation data using a DWave 2000Q quantum annealer and good prediction performance is achieved. The regression test is performed using six different values for the number of fully connected logical qubits, between 20 and 64, the latter being the maximum that can be embedded on the DWave 2000Q. The scaling results indicate that a larger number of qubits gives better prediction accuracy, the best performance being comparable to the best classical regression algorithms reported so far.
 [70] arXiv:1911.06285 (crosslist from cs.LG) [pdf, other]

Title: DomainGAN: Generating Adversarial Examples to Attack Domain Generation Algorithm ClassifiersComments: 8 pages, 9 figuresSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Signal Processing (eess.SP); Machine Learning (stat.ML)
Domain Generation Algorithms (DGAs) are frequently used to generate large numbers of domains for use by botnets. These domains are often used as rendezvous points for the servers that malware has command and control over. There are many algorithms that are used to generate domains, but many of these algorithms are simplistic and are very easy to detect using classical machine learning techniques. In this paper, three different variants of generative adversarial networks (GANs) are used to improve domain generation by making the domains more difficult for machine learning algorithms to detect. The domains generated by traditional DGAs and the GAN based DGA are then compared by using state of the art machine learning based DGA classifiers. The results show that the GAN based DGAs gets detected by the DGA classifiers significantly less than the traditional DGAs. An analysis of the GAN variants is also performed to show which GAN variant produces the most usable domains. As verified by testing results and analysis, the Wasserstein GAN with Gradient Penalty (WGANGP), is the best GAN variant to use as a DGA.
 [71] arXiv:1911.06286 (crosslist from math.NA) [pdf, ps, other]

Title: Importance sampling for a robust and efficient multilevel Monte Carlo estimator for stochastic reaction networksSubjects: Numerical Analysis (math.NA); Computation (stat.CO)
The multilevel Monte Carlo (MLMC) method for continuous time Markov chains, first introduced by Anderson and Higham (2012), is a highly efficient simulation technique that can be used to estimate various statistical quantities for stochastic reaction networks (SRNs), and in particular for stochastic biological systems. Unfortunately, the robustness and performance of the multilevel method can be deteriorated due to the phenomenon of high kurtosis, observed at the deep levels of MLMC, which leads to inaccurate estimates for the sample variance. In this work, we address cases where the highkurtosis phenomenon is due to \textit{catastrophic coupling} (characteristic of pure jump processes where coupled consecutive paths are identical in most of the simulations, while differences only appear in a very small proportion), and introduce a pathwise dependent importance sampling technique that improves the robustness and efficiency of the multilevel method. Our analysis, along with the conducted numerical experiments, demonstrates that our proposed method significantly reduces the kurtosis of the deep levels of MLMC, and also improves the strong convergence rate from $\beta=1$ for the standard case (without importance sampling), to $\beta=1+\delta$, where $0<\delta<1$ is a userselected parameter in our importance sampling algorithm. Due to the complexity theorem of MLMC and given a preselected tolerance, $TOL$, this results in an improvement of the complexity from $\mathcal{O}\left(TOL^{2} \log(TOL)^2\right)$ in the standard case to $\mathcal{O}\left(TOL^{2}\right)$.
 [72] arXiv:1911.06316 (crosslist from eess.SP) [pdf, other]

Title: Realtime Anomaly Detection and Classification in Streaming PMU DataComments: 9 pages, 12 figuresSubjects: Signal Processing (eess.SP); Machine Learning (stat.ML)
Ensuring secure and reliable operations of the power grid is a primary concern of system operators. Phasor measurement units (PMUs) are rapidly being deployed in the grid to provide fastsampled operational data that should enable quicker decisionmaking. This work presents a general interpretable framework for analyzing realtime PMU data, and thus enabling grid operators to understand the current state and to identify anomalies on the fly. Applying statistical learning tools on the streaming data, we first learn an effective dynamical model to describe the current behavior of the system. Next, we use the probabilistic predictions of our learned model to define in a principled way an efficient anomaly detection tool. Finally, the last module of our framework produces onthefly classification of the detected anomalies into common occurrence classes using features that grid operators are familiar with. We demonstrate the efficacy of our interpretable approach through extensive numerical experiments on real PMU data collected from a transmission operator in the USA.
 [73] arXiv:1911.06317 (crosslist from cs.LG) [pdf, other]

Title: Gradientless Descent: HighDimensional ZerothOrder OptimizationAuthors: Daniel Golovin, John Karro, Greg Kochanski, Chansoo Lee, Xingyou Song, Qiuyi (Richard) ZhangComments: 11 main pages, 26 total pagesSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Zerothorder optimization is the process of minimizing an objective $f(x)$, given oracle access to evaluations at adaptively chosen inputs $x$. In this paper, we present two simple yet powerful GradientLess Descent (GLD) algorithms that do not rely on an underlying gradient estimate and are numerically stable. We analyze our algorithm from a novel geometric perspective and present a novel analysis that shows convergence within an $\epsilon$ball of the optimum in $O(kQ\log(n)\log(R/\epsilon))$ evaluations, for {\it any monotone transform} of a smooth and strongly convex objective with latent dimension $k < n$, where the input dimension is $n$, $R$ is the diameter of the input space and $Q$ is the condition number. Our rates are the first of its kind to be both 1) polylogarithmically dependent on dimensionality and 2) invariant under monotone transformations. We further leverage our geometric perspective to show that our analysis is optimal. Both monotone invariance and its ability to utilize a low latent dimensionality are key to the empirical success of our algorithms, as demonstrated on BBOB and MuJoCo benchmarks.
 [74] arXiv:1911.06319 (crosslist from cs.LG) [pdf, ps, other]

Title: The Canonical Distortion Measure for Vector Quantization and Function ApproximationAuthors: Jonathan BaxterJournalref: In: Thrun S., Pratt L. (eds) Learning to Learn (1998). Pages 159177Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
To measure the quality of a set of vector quantization points a means of measuring the distance between a random point and its quantization is required. Common metrics such as the {\em Hamming} and {\em Euclidean} metrics, while mathematically simple, are inappropriate for comparing natural signals such as speech or images. In this paper it is shown how an {\em environment} of functions on an input space $X$ induces a {\em canonical distortion measure} (CDM) on X. The depiction 'canonical" is justified because it is shown that optimizing the reconstruction error of X with respect to the CDM gives rise to optimal piecewise constant approximations of the functions in the environment. The CDM is calculated in closed form for several different function classes. An algorithm for training neural networks to implement the CDM is presented along with some encouraging experimental results.
Replacements for Fri, 15 Nov 19
 [75] arXiv:1108.2883 (replaced) [pdf, ps, other]

Title: Bayesian test of normality versus a Dirichlet process mixture alternativeComments: 24 pages, 5 figures, 1 tableSubjects: Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
 [76] arXiv:1610.10028 (replaced) [pdf, other]

Title: Refiltering hypothesis tests to control sign errorAuthors: Art B. OwenSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
 [77] arXiv:1612.08288 (replaced) [pdf, ps, other]

Title: Instrumental Variable Quantile Regression with MisclassificationAuthors: Takuya UraSubjects: Methodology (stat.ME)
 [78] arXiv:1707.09049 (replaced) [pdf, other]

Title: Variational Joint FilteringSubjects: Machine Learning (stat.ML)
 [79] arXiv:1801.03583 (replaced) [pdf, other]

Title: Graphical Models for Processing Missing DataComments: 34 pages, 5 figuresSubjects: Methodology (stat.ME)
 [80] arXiv:1801.08120 (replaced) [pdf, other]

Title: Optimal Estimation of Simultaneous Signals Using Absolute Inner Product with Applications to Integrative GenomicsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
 [81] arXiv:1805.06970 (replaced) [pdf, other]

Title: Global and Simultaneous Hypothesis Testing for HighDimensional Logistic Regression ModelsSubjects: Methodology (stat.ME)
 [82] arXiv:1807.05832 (replaced) [pdf, ps, other]

Title: Manifold Adversarial LearningComments: 11 pages, 26 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [83] arXiv:1809.02463 (replaced) [pdf, other]

Title: Dirichlet process mixtures under affine transformations of the dataComments: 35 pages, 7 FiguresSubjects: Methodology (stat.ME)
 [84] arXiv:1811.08039 (replaced) [pdf, other]

Title: Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network TrainingSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [85] arXiv:1901.09078 (replaced) [pdf, other]

Title: Finding Archetypal Spaces Using Neural NetworksComments: 9 pages, 10 figures, to be presented at IEEE Big Data 2019Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [86] arXiv:1903.00904 (replaced) [pdf, other]

Title: adVAE: A selfadversarial variational autoencoder with Gaussian anomaly prior knowledge for anomaly detectionComments: This paper has been accepted by Knowledgebased SystemsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [87] arXiv:1903.02380 (replaced) [pdf, other]

Title: Detecting Overfitting via Adversarial ExamplesComments: 17 pagesJournalref: Part of: Advances in Neural Information Processing Systems 32 (NIPS 2019) preproceedingsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [88] arXiv:1903.03894 (replaced) [pdf, other]

Title: GNNExplainer: Generating Explanations for Graph Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [89] arXiv:1903.08987 (replaced) [pdf, other]

Title: Some New Copula Based Distributionfree Tests of Independence among Several Random VariablesComments: arXiv admin note: text overlap with arXiv:1708.07485Subjects: Statistics Theory (math.ST)
 [90] arXiv:1904.07199 (replaced) [pdf, other]

Title: Exact RateDistortion in Autoencoders via Echo NoiseComments: NeurIPS 2019; updated Gaussian baseline results, added disentanglementSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
 [91] arXiv:1904.08497 (replaced) [pdf, other]

Title: An InDepth Study on OpenSet Camera Model IdentificationComments: Published through IEEE Access journalSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [92] arXiv:1904.10921 (replaced) [pdf, other]

Title: Plugin, Trainable Gate for Streamlining Arbitrary Neural NetworksComments: Accepted to AAAI 2020 (Poster)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [93] arXiv:1905.00626 (replaced) [pdf, other]

Title: On Linear Learning with Manycore ProcessorsComments: To appear in: 2019 IEEE 26th International Conference on High Performance Computing (HiPC)Subjects: Performance (cs.PF); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [94] arXiv:1905.10259 (replaced) [pdf, other]

Title: Dichotomize and Generalize: PACBayesian Binary Activated Deep Neural NetworksComments: 22 pages. Accepted for publication at NeurIPS 2019Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [95] arXiv:1905.11232 (replaced) [pdf, other]

Title: Efficient posterior sampling for highdimensional imbalanced logistic regressionComments: 4 figuresSubjects: Methodology (stat.ME); Computation (stat.CO)
 [96] arXiv:1905.11614 (replaced) [pdf, other]

Title: Uncertaintybased Continual Learning with Adaptive RegularizationComments: 10 pages (including Supplementary Materials), Neurips 2019 camera ready versionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [97] arXiv:1906.00531 (replaced) [pdf, other]

Title: Model selection for contextual banditsSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
 [98] arXiv:1906.02685 (replaced) [pdf, other]

Title: Stochastic Bandits with Context DistributionsComments: Accepted at NeurIPS 2019Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [99] arXiv:1906.04159 (replaced) [pdf, other]

Title: Inference and Uncertainty Quantification for Noisy Matrix CompletionComments: published at Proceedings of the National Academy of Sciences Nov 2019, 116 (46) 2293122937Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Statistics Theory (math.ST)
 [100] arXiv:1906.04328 (replaced) [pdf, other]

Title: Importance Resampling for Offpolicy PredictionComments: Recently published in NeurIPS 2019Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [101] arXiv:1906.04834 (replaced) [pdf, other]

Title: Relaxed random walks at scaleComments: 18 pages, 4 figuresSubjects: Populations and Evolution (qbio.PE); Methodology (stat.ME)
 [102] arXiv:1906.06899 (replaced) [pdf, other]

Title: A Provably Correct and Robust Algorithm for Convolutive Nonnegative Matrix FactorizationComments: 24 pages, 4 figures, references updatedSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [103] arXiv:1908.01109 (replaced) [pdf, other]

Title: The Use of Binary Choice Forests to Model and Estimate Discrete ChoicesComments: 56 pages, 4 figures, 11 tablesSubjects: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
 [104] arXiv:1908.03015 (replaced) [pdf, other]

Title: Augmenting Variational Autoencoders with Sparse Labels: A Unified Framework for Unsupervised, Semi(un)supervised, and Supervised LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
 [105] arXiv:1908.07832 (replaced) [pdf, other]

Title: Parsimonious Morpheme Segmentation with an Application to Enriching Word EmbeddingsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [106] arXiv:1909.05289 (replaced) [pdf, other]

Title: Deep Prediction of Investor Interest: a Supervised Clustering ApproachSubjects: Machine Learning (cs.LG); Computational Finance (qfin.CP); Machine Learning (stat.ML)
 [107] arXiv:1910.06539 (replaced) [pdf, other]

Title: Challenges in Bayesian inference via Markov chain Monte Carlo for neural networksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
 [108] arXiv:1910.07295 (replaced) [pdf, other]

Title: Unsupervised Domain Adaptation Meets Offline Recommender LearningAuthors: Yuta SaitoComments: accepted to the NewInML forum (colocated with NeurIPS 2019)Subjects: Machine Learning (stat.ML); Information Retrieval (cs.IR); Machine Learning (cs.LG)
 [109] arXiv:1910.10308 (replaced) [pdf, other]

Title: Weighted Distributed Differential Privacy ERM: Convex and NonconvexSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [110] arXiv:1910.13573 (replaced) [pdf, other]

Title: SemiSupervised Natural Language Approach for FineGrained Classification of Medical ReportsAuthors: Neil Deshmukh, Selin Gumustop, Romane Gauriau, Varun Buch, Bradley Wright, Christopher Bridge, Ram Naidu, Katherine Andriole, Bernardo BizzoComments: Accepted for IEEE publication & presented at MIT URTCSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
 [111] arXiv:1911.00348 (replaced) [pdf, other]

Title: Hierarchical Expert Networks for MetaLearningComments: arXiv admin note: text overlap with arXiv:1907.11452Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [112] arXiv:1911.00847 (replaced) [pdf, other]

Title: Weakly Supervised Deep Learning Approach in Streaming EnvironmentsComments: This paper has been accepted for publication in The 2019 IEEE International Conference on Big Data (IEEE BigData 2019), Los Angeles, CA, USASubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
 [113] arXiv:1911.01731 (replaced) [pdf, other]

Title: GraphAIR: Graph Representation Learning with Neighborhood Aggregation and InteractionComments: 8 pages, in submission to IEEE Transactions on Knowledge and Data EngineeringSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [114] arXiv:1911.02915 (replaced) [pdf, other]

Title: A Statistically Identifiable Model for TensorValued Gaussian Random VariablesComments: 14 pages, 12 figuresSubjects: Signal Processing (eess.SP); Information Theory (cs.IT); Statistics Theory (math.ST)
 [115] arXiv:1911.02966 (replaced) [pdf]

Title: An automated approach for task evaluation using EEG signalsComments: 19 pages, 10 figures, 4 tablesSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
 [116] arXiv:1911.04448 (replaced) [pdf, other]

Title: RealTime Reinforcement LearningComments: NeurIPS 2019Journalref: Neural Information Processing Systems (2019)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [117] arXiv:1911.05109 (replaced) [pdf, other]

Title: Harmonic Mean Point Processes: Proportional Rate Error Minimization for Obtundation PredictionComments: Machine Learning for Health (ML4H) at NeurIPS 2019  Extended AbstractSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
 [118] arXiv:1911.05211 (replaced) [pdf, other]

Title: AMPL: A DataDriven Modeling Pipeline for Drug DiscoveryAuthors: Amanda J. Minnich, Kevin McLoughlin, Margaret Tse, Jason Deng, Andrew Weber, Neha Murad, Benjamin D. Madej, Bharath Ramsundar, Tom Rush, Stacie CaladThomson, Jim Brase, Jonathan E. AllenSubjects: Quantitative Methods (qbio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [119] arXiv:1911.05309 (replaced) [pdf, other]

Title: Adaptive Portfolio by Solving Multiarmed Bandit via Thompson SamplingComments: conferenceSubjects: Machine Learning (cs.LG); Portfolio Management (qfin.PM); Machine Learning (stat.ML)
 [120] arXiv:1911.05485 (replaced) [pdf, ps, other]

Title: Diffusion Improves Graph LearningComments: Published as a conference paper at NeurIPS 2019Journalref: Thirtythird Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [121] arXiv:1911.05684 (replaced) [pdf, other]

Title: A Simulationfree Group Sequential Design with Maxcombo Tests in the Presence of Nonproportional HazardsAuthors: Lili Wang (1), Xiaodong Luo (2), Cheng Zheng (2) ((1) Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, U.S.A. (2) Department of Biostatistics and Programming, Research and Development, Sanofi US, Bridgewater, New Jersey, U.S.A.)Subjects: Methodology (stat.ME)
[ showing up to 2000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 1911, contact, help (Access key information)