We gratefully acknowledge support from
the Simons Foundation
and member institutions


New submissions

[ total of 72 entries: 1-72 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 16 Feb 18

[1]  arXiv:1802.05292 [pdf, ps, other]
Title: Flexible and objective time series analysis: a loss-based approach with two-piece location-scale distributions
Comments: 26 pages, 6 Figures
Subjects: Methodology (stat.ME); Other Statistics (stat.OT)

Two-piece location-scale models are used for modeling data presenting departures from symmetry. In this paper, we propose an objective Bayesian methodology for the tail parameter of two particular distributions of the above family: the skewed exponential power distribution and the skewed generalised logistic distribution. We apply the proposed objective approach to time series models and linear regression models where the error terms follow the distributions object of study. The performance of the proposed approach is illustrated through simulation experiments and real data analysis. The methodology yields improvements in density forecasts, as shown by the analysis we carry out on the electricity prices in Nordpool markets.

[2]  arXiv:1802.05342 [pdf, other]
Title: Spatial Coherence of Oriented White Matter Microstructure: Applications to White Matter Regions Associated with Genetic Similarity
Journal-ref: NeuroImage (2018)
Subjects: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

We present a method to discover differences between populations with respect to the spatial coherence of their oriented white matter microstructure in arbitrarily shaped white matter regions. This method is applied to diffusion MRI scans of a subset of the Human Connectome Project dataset: 57 pairs of monozygotic and 52 pairs of dizygotic twins. After controlling for morphological similarity between twins, we identify 3.7% of all white matter as being associated with genetic similarity (35.1k voxels, $p < 10^{-4}$, false discovery rate 1.5%), 75% of which spatially clusters into twenty-two contiguous white matter regions. Furthermore, we show that the orientation similarity within these regions generalizes to a subset of 47 pairs of non-twin siblings, and show that these siblings are on average as similar as dizygotic twins. The regions are located in deep white matter including the superior longitudinal fasciculus, the optic radiations, the middle cerebellar peduncle, the corticospinal tract, and within the anterior temporal lobe, as well as the cerebellum, brain stem, and amygdalae.
These results extend previous work using undirected fractional anisotrophy for measuring putative heritable influences in white matter. Our multidirectional extension better accounts for crossing fiber connections within voxels. This bottom up approach has at its basis a novel measurement of coherence within neighboring voxel dyads between subjects, and avoids some of the fundamental ambiguities encountered with tractographic approaches to white matter analysis that estimate global connectivity.

[3]  arXiv:1802.05355 [pdf, other]
Title: The Role of Information Complexity and Randomization in Representation Learning
Comments: 35 pages, 3 figures. Submitted for publication
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

A grand challenge in representation learning is to learn the different explanatory factors of variation behind the high dimen- sional data. Encoder models are often determined to optimize performance on training data when the real objective is to generalize well to unseen data. Although there is enough numerical evidence suggesting that noise injection (during training) at the representation level might improve the generalization ability of encoders, an information-theoretic understanding of this principle remains elusive. This paper presents a sample-dependent bound on the generalization gap of the cross-entropy loss that scales with the information complexity (IC) of the representations, meaning the mutual information between inputs and their representations. The IC is empirically investigated for standard multi-layer neural networks with SGD on MNIST and CIFAR-10 datasets; the behaviour of the gap and the IC appear to be in direct correlation, suggesting that SGD selects encoders to implicitly minimize the IC. We specialize the IC to study the role of Dropout on the generalization capacity of deep encoders which is shown to be directly related to the encoder capacity, being a measure of the distinguishability among samples from their representations. Our results support some recent regularization methods.

[4]  arXiv:1802.05370 [pdf, other]
Title: Covariance Function Pre-Training with m-Kernels for Accelerated Bayesian Optimisation
Subjects: Machine Learning (stat.ML)

The paper presents a novel approach to direct covariance function learning for Bayesian optimisation, with particular emphasis on experimental design problems where an existing corpus of condensed knowledge is present. The method presented borrows techniques from reproducing kernel Banach space theory (specifically m-kernels) and leverages them to convert (or re-weight) existing covariance functions into new, problem-specific covariance functions. The key advantage of this approach is that rather than relying on the user to manually select (with some hyperparameter tuning and experimentation) an appropriate covariance function it constructs the covariance function to specifically match the problem at hand. The technique is demonstrated on two real-world problems - specifically alloy design and carbon-fibre manufacturing - as well as a selected test function.

[5]  arXiv:1802.05400 [pdf]
Title: High Dimensional Bayesian Optimization Using Dropout
Comments: 7 pages; Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence 2017
Subjects: Machine Learning (stat.ML)

Scaling Bayesian optimization to high dimensions is challenging task as the global optimization of high-dimensional acquisition function can be expensive and often infeasible. Existing methods depend either on limited active variables or the additive form of the objective function. We propose a new method for high-dimensional Bayesian optimization, that uses a dropout strategy to optimize only a subset of variables at each iteration. We derive theoretical bounds for the regret and show how it can inform the derivation of our algorithm. We demonstrate the efficacy of our algorithms for optimization on two benchmark functions and two real-world applications- training cascade classifiers and optimizing alloy composition.

[6]  arXiv:1802.05431 [pdf, other]
Title: On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo
Comments: 37 pages; 4 figures
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

We provide convergence guarantees in Wasserstein distance for a variety of variance-reduction methods: SAGA Langevin diffusion, SVRG Langevin diffusion and control-variate underdamped Langevin diffusion. We analyze these methods under a uniform set of assumptions on the log-posterior distribution, assuming it to be smooth, strongly convex and Hessian Lipschitz. This is achieved by a new proof technique combining ideas from finite-sum optimization and the analysis of sampling methods. Our sharp theoretical bounds allow us to identify regimes of interest where each method performs better than the others. Our theory is verified with experiments on real-world and synthetic datasets.

[7]  arXiv:1802.05444 [pdf, other]
Title: A Weighted Likelihood Approach Based on Statistical Data Depths
Subjects: Methodology (stat.ME)

We propose a general approach to construct weighted likelihood estimating equations with the aim of obtain robust estimates. The weight, attached to each score contribution, is evaluated by comparing the statistical data depth at the model with that of the sample in a given point. Observations are considered regular when the ratio of these two depths is close to one, whereas, when the ratio is large the corresponding score contribution may be downweigthed. Details and examples are provided for the robust estimation of the parameters in the multivariate normal model. Because of the form of the weights, we expect that, there will be no downweighting under the true model leading to highly efficient estimators. Robustness is illustrated using two real data sets.

[8]  arXiv:1802.05447 [pdf, other]
Title: History PCA: A New Algorithm for Streaming PCA
Subjects: Machine Learning (stat.ML)

In this paper we propose a new algorithm for streaming principal component analysis. With limited memory, small devices cannot store all the samples in the high-dimensional regime. Streaming principal component analysis aims to find the $k$-dimensional subspace which can explain the most variation of the $d$-dimensional data points that come into memory sequentially. In order to deal with large $d$ and large $N$ (number of samples), most streaming PCA algorithms update the current model using only the incoming sample and then dump the information right away to save memory. However the information contained in previously streamed data could be useful. Motivated by this idea, we develop a new streaming PCA algorithm called History PCA that achieves this goal. By using $O(Bd)$ memory with $B\approx 10$ being the block size, our algorithm converges much faster than existing streaming PCA algorithms. By changing the number of inner iterations, the memory usage can be further reduced to $O(d)$ while maintaining a comparable convergence speed. We provide theoretical guarantees for the convergence of our algorithm along with the rate of convergence. We also demonstrate on synthetic and real world data sets that our algorithm compares favorably with other state-of-the-art streaming PCA methods in terms of the convergence speed and performance.

[9]  arXiv:1802.05451 [pdf, other]
Title: Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

Structured prediction is concerned with predicting multiple inter-dependent labels simultaneously. Classical methods like CRF achieve this by maximizing a score function over the set of possible label assignments. Recent extensions use neural networks to either implement the score function or in maximization. The current paper takes an alternative approach, using a neural network to generate the structured output directly, without going through a score function. We take an axiomatic perspective to derive the desired properties and invariances of a such network to certain input permutations, presenting a structural characterization that is provably both necessary and sufficient. We then discuss graph-permutation invariant (GPI) architectures that satisfy this characterization and explain how they can be used for deep structured prediction. We evaluate our approach on the challenging problem of inferring a {\em scene graph} from an image, namely, predicting entities and their relations in the image. We obtain state-of-the-art results on the challenging Visual Genome benchmark, outperforming all recent approaches.

[10]  arXiv:1802.05475 [pdf, ps, other]
Title: Robust and sparse Gaussian graphical modeling under cell-wise contamination
Subjects: Methodology (stat.ME)

Graphical modeling explores dependences among a collection of variables by inferring a graph that encodes pairwise conditional independences. For jointly Gaussian variables, this translates into detecting the support of the precision matrix. Many modern applications feature high-dimensional and contaminated data that complicate this task. In particular, traditional robust methods that down-weight entire observation vectors are often inappropriate as high-dimensional data may feature partial contamination in many observations. We tackle this problem by giving a robust method for sparse precision matrix estimation based on the $\gamma$-divergence under a cell-wise contamination model. Simulation studies demonstrate that our procedure outperforms existing methods especially for highly contaminated data.

[11]  arXiv:1802.05495 [pdf, other]
Title: An Operational (Preasymptotic) Measure of Fat-tailedness
Subjects: Methodology (stat.ME); Statistical Finance (q-fin.ST)

This note presents an operational measure of fat-tailedness for univariate probability distributions, in $[0,1]$ where 0 is maximally thin-tailed (Gaussian) and 1 is maximally fat-tailed.
Among others,1) it helps assess the sample size needed to establish a comparative $n$ needed for statistical significance, 2) allows practical comparisons across classes of fat-tailed distributions, 3) helps understand some inconsistent attributes of the lognormal, pending on the parametrization of its scale parameter.
The literature is rich for what concerns asymptotic behavior, but there is a large void for finite values of $n$, those needed for operational purposes. Conventional measures of fat-tailedness, namely 1) the tail index for the power law class, and 2) Kurtosis for finite moment distributions fail to apply to some distributions, and do not allow comparisons across classes and parametrization, that is between power laws outside the Levy-Stable basin, or power laws to distributions in other classes, or power laws for different number of summands. How can one compare a sum of 100 Student T distributed random variables with 3 degrees of freedom to one in a Levy-Stable or a Lognormal class? How can one compare a sum of 100 Student T with 3 degrees of freedom to a single Student T with 2 degrees of freedom?
We propose an operational and heuristic measure that allow us to compare $n$-summed independent variables under all distributions with finite first moment. The method is based on the rate of convergence of the Law of Large numbers for finite sums, $n$-summands specifically.
We get either explicit expressions or simulation results and bounds for the lognormal, exponential, Pareto, and the Student T distributions in their various calibrations --in addition to the general Pearson classes.

[12]  arXiv:1802.05530 [pdf, other]
Title: Modelling spatial heterogeneity and discontinuities using Voronoi tessellations
Subjects: Methodology (stat.ME)

Many methods for modelling spatial processes assume global smoothness properties; such assumptions are often violated in practice. We introduce a method for modelling spatial processes that display heterogeneity or contain discontinuities. The problem of non-stationarity is dealt with by using a combination of Voronoi tessellation to partition the input space, and a separate Gaussian process to model the data on each region of the partitioned space. Our method is highly flexible because we allow the Voronoi cells to form relationships with each other, which can enable non-convex and disconnected regions to be considered. In such problems, identifying the borders between regions is often of great importance and we propose an adaptive sampling method to gain extra information along such borders. The method is illustrated with simulation studies and application to real data.

[13]  arXiv:1802.05550 [pdf, other]
Title: ICA based on Split Generalized Gaussian
Comments: arXiv admin note: substantial text overlap with arXiv:1701.09160
Subjects: Machine Learning (stat.ML)

Independent Component Analysis (ICA) - one of the basic tools in data analysis - aims to find a coordinate system in which the components of the data are independent. Most popular ICA methods use kurtosis as a metric of non-Gaussianity to maximize, such as FastICA and JADE. However, their assumption of fourth-order moment (kurtosis) may not always be satisfied in practice. One of the possible solution is to use third-order moment (skewness) instead of kurtosis, which was applied in $ICA_{SG}$ and EcoICA.
In this paper we present a competitive approach to ICA based on the Split Generalized Gaussian distribution (SGGD), which is well adapted to heavy-tailed as well as asymmetric data. Consequently, we obtain a method which works better than the classical approaches, in both cases: heavy tails and non-symmetric data. \end{abstract}

[14]  arXiv:1802.05570 [pdf, other]
Title: Optimal Transport: Fast Probabilistic Approximation with Exact Solvers
Subjects: Computation (stat.CO); Methodology (stat.ME)

We propose a simple subsampling scheme for fast randomized approximate computation of optimal transport distances. This scheme operates on a random subset of the full data and can use any exact algorithm as a black-box back-end, including state-of-the-art solvers and entropically penalized versions. It is based on averaging the exact distances between empirical measures generated from independent samples from the original measures and can easily be tuned towards higher accuracy or shorter computation times. To this end, we give non-asymptotic deviation bounds for its accuracy. In particular, we show that in many important cases, including images, the approximation error is independent of the size of the full problem. We present numerical experiments that demonstrate that a very good approximation in typical applications can be obtained in a computation time that is several orders of magnitude smaller than what is required for exact computation of the full problem.

[15]  arXiv:1802.05584 [pdf, other]
Title: Convolutional Analysis Operator Learning: Acceleration, Convergence, Application, and Neural Networks
Comments: 19 pages, 9 figures
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)

Convolutional operator learning is increasingly gaining attention in many signal processing and computer vision applications. Learning kernels has mostly relied on so-called local approaches that extract and store many overlapping patches across training signals. Due to memory demands, local approaches have limitations when learning kernels from large datasets -- particularly with multi-layered structures, e.g., convolutional neural network (CNN) -- and/or applying the learned kernels to high-dimensional signal recovery problems. The so-called global approach has been studied within the "synthesis" signal model, e.g., convolutional dictionary learning, overcoming the memory problems by careful algorithmic designs. This paper proposes a new convolutional analysis operator learning (CAOL) framework in the global approach, and develops a new convergent Block Proximal Gradient method using a Majorizer (BPG-M) to solve the corresponding block multi-nonconvex problems. To learn diverse filters within the CAOL framework, this paper introduces an orthogonality constraint that enforces a tight-frame (TF) filter condition, and a regularizer that promotes diversity between filters. Numerical experiments show that, for tight majorizers, BPG-M significantly accelerates the CAOL convergence rate compared to the state-of-the-art method, BPG. Numerical experiments for sparse-view computational tomography show that CAOL using TF filters significantly improves reconstruction quality compared to a conventional edge-preserving regularizer. Finally, this paper shows that CAOL can be useful to mathematically model a CNN, and the corresponding updates obtained via BPG-M coincide with core modules of the CNN.

[16]  arXiv:1802.05622 [pdf, other]
Title: Conditioning of three-dimensional generative adversarial networks for pore and reservoir-scale models
Comments: 5 pages, 2 figures
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)

Geostatistical modeling of petrophysical properties is a key step in modern integrated oil and gas reservoir studies. Recently, generative adversarial networks (GAN) have been shown to be a successful method for generating unconditional simulations of pore- and reservoir-scale models. This contribution leverages the differentiable nature of neural networks to extend GANs to the conditional simulation of three-dimensional pore- and reservoir-scale models. Based on the previous work of Yeh et al. (2016), we use a content loss to constrain to the conditioning data and a perceptual loss obtained from the evaluation of the GAN discriminator network. The technique is tested on the generation of three-dimensional micro-CT images of a Ketton limestone constrained by two-dimensional cross-sections, and on the simulation of the Maules Creek alluvial aquifer constrained by one-dimensional sections. Our results show that GANs represent a powerful method for sampling conditioned pore and reservoir samples for stochastic reservoir evaluation workflows.

[17]  arXiv:1802.05631 [pdf, other]
Title: Direct Estimation of Differences in Causal Graphs
Subjects: Methodology (stat.ME)

We consider the problem of estimating the differences between two causal directed acyclic graph (DAG) models given i.i.d. samples from each model. This is of interest for example in genomics, where large-scale gene expression data is becoming available under different cellular contexts, for different cell types, or disease states. Changes in the structure or edge weights of the underlying causal graphs reflect alterations in the gene regulatory networks and provide important insights into the emergence of a particular phenotype. While the individual networks are usually very large, containing high-degree hub nodes and thus difficult to learn, the overall change between two related networks can be sparse. We here provide the first provably consistent method for directly estimating the differences in a pair of causal DAGs without separately learning two possibly large and dense DAG models and computing their difference. Our two-step algorithm first uses invariance tests between regression coefficients of the two data sets to estimate the skeleton of the difference graph and then orients some of the edges using invariance tests between regression residual variances. We demonstrate the properties of our method through a simulation study and apply it to the analysis of gene expression data from ovarian cancer and during T-cell activation.

[18]  arXiv:1802.05635 [pdf, ps, other]
Title: Nonparametric Bayesian posterior contraction rates for scalar diffusions with high-frequency data
Authors: Kweku Abraham
Subjects: Statistics Theory (math.ST)

We consider inference in the scalar diffusion model $dX_t=b(X_t)dt+\sigma(X_t)dW_t$ with discrete data $(X_{j\Delta_n})_{0\leq j \leq n}$, $n\to \infty,~\Delta_n\to 0$ and periodic coefficients. For $\sigma$ given, we prove a general theorem detailing conditions under which Bayesian posteriors will contract in $L^2$-distance around the true drift function $b_0$ at the frequentist minimax rate (up to logarithmic factors) over Besov smoothness classes. We exhibit natural nonparametric priors which satisfy our conditions. Our results show that the Bayesian method adapts both to an unknown sampling regime and to unknown smoothness.

[19]  arXiv:1802.05650 [pdf, other]
Title: Ranks and Pseudo-Ranks - Paradoxical Results of Rank Tests -
Comments: 19 pages, 0 figures
Subjects: Statistics Theory (math.ST)

Rank-based inference methods are applied in various disciplines, typically when procedures relying on standard normal theory are not justifiable, for example when data are not symmetrically distributed, contain outliers, or responses are even measured on ordinal scales. Various specific rank-based methods have been developed for two and more samples, and also for general factorial designs (e.g., Kruskal-Wallis test, Jonckheere-Terpstra test). It is the aim of the present paper (1) to demonstrate that traditional rank-procedures for several samples or general factorial designs may lead to paradoxical results in case of unbalanced samples, (2) to explain why this is the case, and (3) to provide a way to overcome these disadvantages of traditional rankbased inference. Theoretical investigations show that the paradoxical results can be explained by carefully considering the non-centralities of the test statistics which may be non-zero for the traditional tests in unbalanced designs. These non-centralities may even become arbitrarily large for increasing sample sizes in the unbalanced case. A simple solution is the use of socalled pseudo-ranks instead of ranks. As a special case, we illustrate the effects in sub-group analyses which are often used when dealing with rare diseases.

[20]  arXiv:1802.05664 [pdf, other]
Title: DeepMatch: Balancing Deep Covariate Representations for Causal Inference Using Adversarial Training
Authors: Nathan Kallus
Subjects: Machine Learning (stat.ML)

We study optimal covariate balance for causal inferences from observational data when rich covariates and complex relationships necessitate flexible modeling with neural networks. Standard approaches such as propensity weighting and matching/balancing fail in such settings due to miscalibrated propensity nets and inappropriate covariate representations, respectively. We propose a new method based on adversarial training of a weighting and a discriminator network that effectively addresses this methodological gap. This is demonstrated through new theoretical characterizations of the method as well as empirical results using both fully connected architectures to learn complex relationships and convolutional architectures to handle image confounders, showing how this new method can enable strong causal analyses in these challenging settings.

[21]  arXiv:1802.05680 [pdf, other]
Title: Constraining the Dynamics of Deep Probabilistic Models
Comments: 12 pages
Subjects: Machine Learning (stat.ML)

We introduce a novel generative formulation of deep probabilistic models implementing "soft" constraints on the dynamics of the functions they can model. In particular we develop a flexible methodological framework where the modeled functions and derivatives of a given order are subject to inequality or equality constraints. We characterize the posterior distribution over model and constraint parameters through stochastic variational inference techniques. As a result, the proposed approach allows for accurate and scalable uncertainty quantification of predictions and parameters. We demonstrate the application of equality constraints in the challenging problem of parameter inference in ordinary differential equation models, while we showcase the application of inequality constraints on monotonic regression on count data. The proposed approach is extensively tested in several experimental settings, leading to highly competitive results in challenging modeling applications, while offering high expressiveness, flexibility and scalability.

[22]  arXiv:1802.05688 [pdf, other]
Title: Simulation assisted machine learning
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Quantitative Methods (q-bio.QM)

Predicting how a proposed cancer treatment will affect a given tumor can be cast as a machine learning problem, but the complexity of biological systems, the number of potentially relevant genomic and clinical features, and the lack of very large scale patient data repositories make this a unique challenge. "Pure data" approaches to this problem are underpowered to detect combinatorially complex interactions and are bound to uncover false correlations despite statistical precautions taken (1). To investigate this setting, we propose a method to integrate simulations, a strong form of prior knowledge, into machine learning, a combination which to date has been largely unexplored. The results of multiple simulations (under various uncertainty scenarios) are used to compute similarity measures between every pair of samples: sample pairs are given a high similarity score if they behave similarly under a wide range of simulation parameters. These similarity values, rather than the original high dimensional feature data, are used to train kernelized machine learning algorithms such as support vector machines, thus handling the curse-of-dimensionality that typically affects genomic machine learning. Using four synthetic datasets of complex systems--three biological models and one network flow optimization model--we demonstrate that when the number of training samples is small compared to the number of features, the simulation kernel approach dominates over no-prior-knowledge methods. In addition to biology and medicine, this approach should be applicable to other disciplines, such as weather forecasting, financial markets, and agricultural management, where predictive models are sought and informative yet approximate simulations are available. The Python SimKern software, the models (in MATLAB, Octave, and R), and the datasets are made freely available at https://github.com/davidcraft/SimKern.

Cross-lists for Fri, 16 Feb 18

[23]  arXiv:1705.01166 (cross-list from physics.data-an) [pdf, other]
Title: Maximizing the information learned from finite data selects a simple model
Comments: 9 pages, 8 figures. v3 has improved discussion and adds an appendix about MDL and Bayes factors, and matches version to appear in PNAS (modulo comma placement). Title changed from "Rational Ignorance: Simpler Models Learn More Information from Finite Data"
Journal-ref: PNAS February 2018
Subjects: Data Analysis, Statistics and Probability (physics.data-an); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)

We use the language of uninformative Bayesian prior choice to study the selection of appropriately simple effective models. We advocate for the prior which maximizes the mutual information between parameters and predictions, learning as much as possible from limited data. When many parameters are poorly constrained by the available data, we find that this prior puts weight only on boundaries of the parameter manifold. Thus it selects a lower-dimensional effective theory in a principled way, ignoring irrelevant parameter directions. In the limit where there is sufficient data to tightly constrain any number of parameters, this reduces to Jeffreys prior. But we argue that this limit is pathological when applied to the hyper-ribbon parameter manifolds generic in science, because it leads to dramatic dependence on effects invisible to experiment.

[24]  arXiv:1802.05312 (cross-list from cs.LG) [pdf, other]
Title: Learning Deep Disentangled Embeddings with the F-Statistic Loss
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Deep-embedding methods aim to discover representations of a domain that make explicit the domain's class structure. Disentangling methods aim to make explicit compositional or factorial structure. We combine these two active but independent lines of research and propose a new paradigm for discovering disentangled representations of class structure; these representations reveal the underlying factors that jointly determine class. We propose and evaluate a novel loss function based on the $F$ statistic, which describes the separation of two or more distributions. By ensuring that distinct classes are well separated on a subset of embedding dimensions, we obtain embeddings that are useful for few-shot learning. By not requiring separation on all dimensions, we encourage the discovery of disentangled representations. Our embedding procedure matches or beats state-of-the-art procedures on deep embeddings, as evaluated by performance on recall@$k$ and few-shot learning tasks. To evaluate alternative approaches on disentangling, we formalize two key properties of a disentangled representation: modularity and explicitness. By these criteria, our procedure yields disentangled representations, whereas traditional procedures fail. The goal of our work is to obtain more interpretable, manipulable, and generalizable deep representations of concepts and categories.

[25]  arXiv:1802.05313 (cross-list from cs.AI) [pdf, other]
Title: Reinforcement Learning from Imperfect Demonstrations
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)

Robust real-world learning should benefit from both demonstrations and interactions with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on the reward received from the environment. These tasks have divergent losses which are difficult to jointly optimize and such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm, Normalized Actor-Critic (NAC), that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. NAC learns an initial policy network from demonstrations and refines the policy in the environment, surpassing the demonstrator's performance. Crucially, both learning from demonstration and interactive refinement use the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.

[26]  arXiv:1802.05315 (cross-list from cs.LG) [pdf, other]
Title: Online Learning for Non-Stationary A/B Tests
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

The rollout of new versions of a feature in modern applications is a manual multi-stage process, as the feature is released to ever larger groups of users, while its performance is carefully monitored. This kind of A/B testing is ubiquitous, but suboptimal, as the monitoring requires heavy human intervention, is not guaranteed to capture consistent, but short-term fluctuations in performance, and is inefficient, as better versions take a long time to reach the full population.
In this work we formulate this question as that of expert learning, and give a new algorithm Follow-The-Best-Interval, FTBI, that works in dynamic, non-stationary environments. Our approach is practical, simple, and efficient, and has rigorous guarantees on its performance. Finally, we perform a thorough evaluation on synthetic and real world datasets and show that our approach outperforms current state-of-the-art methods.

[27]  arXiv:1802.05319 (cross-list from cs.SE) [pdf, other]
Title: 500+ Times Faster Than Deep Learning (A Case Study Exploring Faster Methods for Text Mining StackOverflow)
Subjects: Software Engineering (cs.SE); Learning (cs.LG); Machine Learning (stat.ML)

Deep learning methods are useful for high-dimensional data and are becoming widely used in many areas of software engineering. Deep learners utilizes extensive computational power and can take a long time to train-- making it difficult to widely validate and repeat and improve their results. Further, they are not the best solution in all domains. For example, recent results show that for finding related Stack Overflow posts, a tuned SVM performs similarly to a deep learner, but is significantly faster to train. This paper extends that recent result by clustering the dataset, then tuning very learners within each cluster. This approach is over 500 times faster than deep learning (and over 900 times faster if we use all the cores on a standard laptop computer). Significantly, this faster approach generates classifiers nearly as good (within 2\% F1 Score) as the much slower deep learning method. Hence we recommend this faster methods since it is much easier to reproduce and utilizes far fewer CPU resources. More generally, we recommend that before researchers release research results, that they compare their supposedly sophisticated methods against simpler alternatives (e.g applying simpler learners to build local models).

[28]  arXiv:1802.05333 (cross-list from econ.EM) [pdf, ps, other]
Title: Bootstrap-Assisted Unit Root Testing With Piecewise Locally Stationary Errors
Comments: This paper has been accepted for publication and will appear in a revised form, subsequent to editorial input by Cambridge University Press, in Econometric Theory
Subjects: Econometrics (econ.EM); Statistics Theory (math.ST)

In unit root testing, a piecewise locally stationary process is adopted to accommodate nonstationary errors that can have both smooth and abrupt changes in second- or higher-order properties. Under this framework, the limiting null distributions of the conventional unit root test statistics are derived and shown to contain a number of unknown parameters. To circumvent the difficulty of direct consistent estimation, we propose to use the dependent wild bootstrap to approximate the non-pivotal limiting null distributions and provide a rigorous theoretical justification for bootstrap consistency. The proposed method is compared through finite sample simulations with the recolored wild bootstrap procedure, which was developed for errors that follow a heteroscedastic linear process. Further, a combination of autoregressive sieve recoloring with the dependent wild bootstrap is shown to perform well. The validity of the dependent wild bootstrap in a nonstationary setting is demonstrated for the first time, showing the possibility of extensions to other inference problems associated with locally stationary processes.

[29]  arXiv:1802.05335 (cross-list from cs.LG) [pdf, other]
Title: Multimodal Generative Models for Scalable Weakly-Supervised Learning
Authors: Mike Wu, Noah Goodman
Comments: 8 pages, 10 figures
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Multiple modalities often co-occur when describing natural phenomena. Learning a joint representation of these modalities should yield deeper and more useful representations. Previous work have proposed generative models to handle multi-modal input. However, these models either do not learn a joint distribution or require complex additional computations to handle missing data. Here, we introduce a multimodal variational autoencoder that uses a product-of-experts inference network and a sub-sampled training paradigm to solve the multi-modal inference problem. Notably, our model shares parameters to efficiently learn under any combination of missing modalities, thereby enabling weakly-supervised learning. We apply our method on four datasets and show that we match state-of-the-art performance using many fewer parameters. In each case our approach yields strong weakly-supervised results. We then consider a case study of learning image transformations---edge detection, colorization, facial landmark segmentation, etc.---as a set of modalities. We find appealing results across this range of tasks.

[30]  arXiv:1802.05339 (cross-list from physics.data-an) [pdf, other]
Title: Two- and Multi-dimensional Curve Fitting using Bayesian Inference
Subjects: Data Analysis, Statistics and Probability (physics.data-an); Instrumentation and Methods for Astrophysics (astro-ph.IM); Statistics Theory (math.ST)

Fitting models to data using Bayesian inference is quite common, but when each point in parameter space gives a curve, fitting the curve to a data set requires new nuisance parameters, which specify the metric embedding the one-dimensional curve into the higher-dimensional space occupied by the data. A generic formalism for curve fitting in the context of Bayesian inference is developed which shows how the aforementioned metric arises. The result is a natural generalization of previous works, and is compared to oft-used frequentist approaches and similar Bayesian techniques.

[31]  arXiv:1802.05351 (cross-list from cs.CR) [pdf, other]
Title: Stealing Hyperparameters in Machine Learning
Comments: To appear in the Proceedings of the IEEE Symposium on Security and Privacy, May 2018
Subjects: Cryptography and Security (cs.CR); Learning (cs.LG); Machine Learning (stat.ML)

Hyperparameters are critical in machine learning, as different hyperparameters often result in models with significantly different performance. Hyperparameters may be deemed confidential because of their commercial value and the confidentiality of the proprietary algorithms that the learner uses to learn them. In this work, we propose attacks on stealing the hyperparameters that are learned by a learner. We call our attacks hyperparameter stealing attacks. Our attacks are applicable to a variety of popular machine learning algorithms such as ridge regression, logistic regression, support vector machine, and neural network. We evaluate the effectiveness of our attacks both theoretically and empirically. For instance, we evaluate our attacks on Amazon Machine Learning. Our results demonstrate that our attacks can accurately steal hyperparameters. We also study countermeasures. Our results highlight the need for new defenses against our hyperparameter stealing attacks for certain machine learning algorithms.

[32]  arXiv:1802.05374 (cross-list from math.OC) [pdf, other]
Title: A Progressive Batching L-BFGS Method for Machine Learning
Comments: 25 pages, 17 figures, 2 tables
Subjects: Optimization and Control (math.OC); Learning (cs.LG); Machine Learning (stat.ML)

The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization properties, L-BFGS is currently not considered an algorithm of choice for large-scale machine learning applications. One need not, however, choose between the two extremes represented by the full batch or highly stochastic regimes, and may instead follow a progressive batching approach in which the sample size increases during the course of the optimization. In this paper, we present a new version of the L-BFGS algorithm that combines three basic components - progressive batching, a stochastic line search, and stable quasi-Newton updating - and that performs well on training logistic regression and deep neural networks. We provide supporting convergence theory for the method.

[33]  arXiv:1802.05380 (cross-list from cs.LG) [pdf, other]
Title: Active Feature Acquisition with Supervised Matrix Completion
Comments: 9 pages, 8 figures
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Feature missing is a serious problem in many applications, which may lead to low quality of training data and further significantly degrade the learning performance. While feature acquisition usually involves special devices or complex process, it is expensive to acquire all feature values for the whole dataset. On the other hand, features may be correlated with each other, and some values may be recovered from the others. It is thus important to decide which features are most informative for recovering the other features as well as improving the learning performance. In this paper, we try to train an effective classification model with least acquisition cost by jointly performing active feature querying and supervised matrix completion. When completing the feature matrix, a novel target function is proposed to simultaneously minimize the reconstruction error on observed entries and the supervised loss on training data. When querying the feature value, the most uncertain entry is actively selected based on the variance of previous iterations. In addition, a bi-objective optimization method is presented for cost-aware active selection when features bear different acquisition costs. The effectiveness of the proposed approach is well validated by both theoretical analysis and experimental study.

[34]  arXiv:1802.05386 (cross-list from cs.LG) [pdf]
Title: Shamap: Shape-based Manifold Learning
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

For manifold learning, it is assumed that high-dimensional sample/data points are on an embedded low-dimensional manifold. Usually, distances among samples are computed to represent the underlying data structure, for a specified distance measure such as the Euclidean distance or geodesic distance. For manifold learning, here we propose a metric according to the angular change along a geodesic line, thereby reflecting the underlying shape-oriented information or the similarity between high- and low-dimensional representations of a data cloud. Our numerical results are described to demonstrate the feasibility and merits of the proposed dimensionality reduction scheme

[35]  arXiv:1802.05392 (cross-list from cs.LG) [pdf, other]
Title: Reducing over-clustering via the powered Chinese restaurant process
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Dirichlet process mixture (DPM) models tend to produce many small clusters regardless of whether they are needed to accurately characterize the data - this is particularly true for large data sets. However, interpretability, parsimony, data storage and communication costs all are hampered by having overly many clusters. We propose a powered Chinese restaurant process to limit this kind of problem and penalize over clustering. The method is illustrated using some simulation examples and data with large and small sample size including MNIST and the Old Faithful Geyser data.

[36]  arXiv:1802.05394 (cross-list from cs.LG) [pdf, other]
Title: Cost-Effective Training of Deep CNNs with Active Model Adaptation
Comments: 9 pages
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Deep convolutional neural networks have achieved great success in various applications. However, training an effective DNN model for a specific task is rather challenging because it requires a prior knowledge or experience to design the network architecture, repeated trial-and-error process to tune the parameters, and a large set of labeled data to train the model. In this paper, we propose to overcome these challenges by actively adapting a pre-trained model to a new task with less labeled examples. Specifically, the pre-trained model is iteratively fine tuned based on the most useful examples. The examples are actively selected based on a novel criterion, which jointly estimates the potential contribution of an instance on optimizing the feature representation as well as improving the classification model for the target task. On one hand, the pre-trained model brings plentiful information from its original task, avoiding redesign of the network architecture or training from scratch; and on the other hand, the labeling cost can be significantly reduced by active label querying. Experiments on multiple datasets and different pre-trained models demonstrate that the proposed approach can achieve cost-effective training of DNNs.

[37]  arXiv:1802.05408 (cross-list from cs.IT) [pdf, ps, other]
Title: "Dependency Bottleneck" in Auto-encoding Architectures: an Empirical Study
Subjects: Information Theory (cs.IT); Learning (cs.LG); Machine Learning (stat.ML)

Recent works investigated the generalization properties in deep neural networks (DNNs) by studying the Information Bottleneck in DNNs. However, the mea- surement of the mutual information (MI) is often inaccurate due to the density estimation. To address this issue, we propose to measure the dependency instead of MI between layers in DNNs. Specifically, we propose to use Hilbert-Schmidt Independence Criterion (HSIC) as the dependency measure, which can measure the dependence of two random variables without estimating probability densities. Moreover, HSIC is a special case of the Squared-loss Mutual Information (SMI). In the experiment, we empirically evaluate the generalization property using HSIC in both the reconstruction and prediction auto-encoding (AE) architectures.

[38]  arXiv:1802.05411 (cross-list from cs.LG) [pdf, ps, other]
Title: Selecting the Best in GANs Family: a Post Selection Inference Framework
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

"Which Generative Adversarial Networks (GANs) generates the most plausible images?" has been a frequently asked question among researchers. To address this problem, we first propose an \emph{incomplete} U-statistics estimate of maximum mean discrepancy $\mathrm{MMD}_{inc}$ to measure the distribution discrepancy between generated and real images. $\mathrm{MMD}_{inc}$ enjoys the advantages of asymptotic normality, computation efficiency, and model agnosticity. We then propose a GANs analysis framework to select and test the "best" member in GANs family using the Post Selection Inference (PSI) with $\mathrm{MMD}_{inc}$. In the experiments, we adopt the proposed framework on 7 GANs variants and compare their $\mathrm{MMD}_{inc}$ scores.

[39]  arXiv:1802.05429 (cross-list from cs.SD) [pdf, ps, other]
Title: Blind Source Separation with Optimal Transport Non-negative Matrix Factorization
Comments: 22 pages, 7 figures, 2 additional files
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)

Optimal transport as a loss for machine learning optimization problems has recently gained a lot of attention. Building upon recent advances in computational optimal transport, we develop an optimal transport non-negative matrix factorization (NMF) algorithm for supervised speech blind source separation (BSS). Optimal transport allows us to design and leverage a cost between short-time Fourier transform (STFT) spectrogram frequencies, which takes into account how humans perceive sound. We give empirical evidence that using our proposed optimal transport NMF leads to perceptually better results than Euclidean NMF, for both isolated voice reconstruction and BSS tasks. Finally, we demonstrate how to use optimal transport for cross domain sound processing tasks, where frequencies represented in the input spectrograms may be different from one spectrogram to another.

[40]  arXiv:1802.05472 (cross-list from cs.LG) [pdf]
Title: Admissible Time Series Motif Discovery with Missing Data
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

The discovery of time series motifs has emerged as one of the most useful primitives in time series data mining. Researchers have shown its utility for exploratory data mining, summarization, visualization, segmentation, classification, clustering, and rule discovery. Although there has been more than a decade of extensive research, there is still no technique to allow the discovery of time series motifs in the presence of missing data, despite the well-documented ubiquity of missing data in scientific, industrial, and medical datasets. In this work, we introduce a technique for motif discovery in the presence of missing data. We formally prove that our method is admissible, producing no false negatives. We also show that our method can piggy-back off the fastest known motif discovery method with a small constant factor time/space overhead. We will demonstrate our approach on diverse datasets with varying amounts of missing data

[41]  arXiv:1802.05637 (cross-list from cs.LG) [pdf, other]
Title: cGANs with Projection Discriminator
Comments: Published as a conference paper at ICLR 2018
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

We propose a novel, projection based way to incorporate the conditional information into the discriminator of GANs that respects the role of the conditional information in the underlining probabilistic model. This approach is in contrast with most frameworks of conditional GANs used in application today, which use the conditional information by concatenating the (embedded) conditional vector to the feature vectors. With this modification, we were able to significantly improve the quality of the class conditional image generation on ILSVRC2012 (ImageNet) 1000-class image dataset from the current state-of-the-art result, and we achieved this with a single pair of a discriminator and a generator. We were also able to extend the application to super-resolution and succeeded in producing highly discriminative super-resolution images. This new structure also enabled high quality category transformation based on parametric functional transformation of conditional batch normalization layers in the generator.

[42]  arXiv:1802.05666 (cross-list from cs.LG) [pdf, ps, other]
Title: Adversarial Risk and the Dangers of Evaluating Against Weak Attacks
Subjects: Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)

This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. The existence of adversarial examples in trained neural networks reflects the fact that expected risk alone does not capture the model's performance against worst-case inputs. We motivate the use of adversarial risk as an objective, although it cannot easily be computed exactly. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may be obscured to adversaries, by optimizing this surrogate rather than the true adversarial risk. We demonstrate that this is a significant problem in practice by repurposing gradient-free optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that our formulations and results will help researchers to develop more powerful defenses.

[43]  arXiv:1802.05693 (cross-list from cs.LG) [pdf, ps, other]
Title: Bandit Learning with Positive Externalities
Comments: 27 pages, 1 table
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Many platforms are characterized by the fact that future user arrivals are likely to have preferences similar to users who were satisfied in the past. In other words, arrivals exhibit {\em positive externalities}. We study multiarmed bandit (MAB) problems with positive externalities. Our model has a finite number of arms and users are distinguished by the arm(s) they prefer. We model positive externalities by assuming that the preferred arms of future arrivals are self-reinforcing based on the experiences of past users. We show that classical algorithms such as UCB which are optimal in the classical MAB setting may even exhibit linear regret in the context of positive externalities. We provide an algorithm which achieves optimal regret and show that such optimal regret exhibits substantially different structure from that observed in the standard MAB setting.

[44]  arXiv:1802.05694 (cross-list from cs.CL) [pdf, other]
Title: Multinomial Adversarial Networks for Multi-Domain Text Classification
Comments: NAACL 2018
Subjects: Computation and Language (cs.CL); Learning (cs.LG); Machine Learning (stat.ML)

Many text classification tasks are known to be highly domain-dependent. Unfortunately, the availability of training data can vary drastically across domains. Worse still, for some domains there may not be any annotated data at all. In this work, we propose a multinomial adversarial network (MAN) to tackle the text classification problem in this real-world multidomain setting (MDTC). We provide theoretical justifications for the MAN framework, proving that different instances of MANs are essentially minimizers of various f-divergence metrics (Ali and Silvey, 1966) among multiple probability distributions. MANs are thus a theoretically sound generalization of traditional adversarial networks that discriminate over two distributions. More specifically, for the MDTC task, MAN learns features that are invariant across multiple domains by resorting to its ability to reduce the divergence among the feature distributions of each domain. We present experimental results showing that MANs significantly outperform the prior art on the MDTC task. We also show that MANs achieve state-of-the-art performance for domains with no labeled data.

[45]  arXiv:1802.05695 (cross-list from cs.CL) [pdf, other]
Title: Explainable Prediction of Medical Codes from Clinical Text
Comments: NAACL 2018
Subjects: Computation and Language (cs.CL); Learning (cs.LG); Machine Learning (stat.ML)

Clinical notes are text documents that are created by clinicians for each patient encounter. They are typically accompanied by medical codes, which describe the diagnosis and treatment. Annotating these codes is labor intensive and error prone; furthermore, the connection between the codes and the text is not annotated, obscuring the reasons and details behind specific diagnoses and treatments. We present an attentional convolutional network that predicts medical codes from clinical text. Our method aggregates information across the document using a convolutional neural network, and uses an attention mechanism to select the most relevant segments for each of the thousands of possible codes. The method is accurate, achieving precision @ 8 of 0.7 and a Micro-F1 of 0.52, which are both significantly better than the prior state of the art. Furthermore, through an interpretability evaluation by a physician, we show that the attention mechanism identifies meaningful explanations for each code assignment.

Replacements for Fri, 16 Feb 18

[46]  arXiv:1207.5895 (replaced) [pdf, ps, other]
Title: Social learning equilibria
Subjects: Statistics Theory (math.ST)
[47]  arXiv:1503.05436 (replaced) [pdf, other]
Title: Inference in Additively Separable Models With a High-Dimensional Set of Conditioning Variables
Authors: Damian Kozbur
Subjects: Statistics Theory (math.ST)
[48]  arXiv:1604.04706 (replaced) [pdf, other]
Title: DS-MLR: Exploiting Double Separability for Scaling up Distributed Multinomial Logistic Regression
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
[49]  arXiv:1605.09232 (replaced) [pdf, ps, other]
Title: Tradeoffs between Convergence Speed and Reconstruction Accuracy in Inverse Problems
Comments: To appear in IEEE Transactions on Signal Processing
Subjects: Numerical Analysis (cs.NA); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
[50]  arXiv:1702.05008 (replaced) [pdf, other]
Title: Tree Ensembles with Rule Structured Horseshoe Regularization
Comments: 24 pages. R package
Subjects: Methodology (stat.ME); Machine Learning (stat.ML)
[51]  arXiv:1703.03165 (replaced) [pdf, other]
Title: Perturbation Bootstrap in Adaptive Lasso
Comments: 43 pages, 3 tables, 2 figures
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
[52]  arXiv:1704.06176 (replaced) [pdf, other]
Title: Segmentation of the Proximal Femur from MR Images using Deep Convolutional Neural Networks
Comments: 26 pages, 5 figures, and 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Machine Learning (stat.ML)
[53]  arXiv:1704.07949 (replaced) [pdf, other]
Title: Reconditioning your quantile function
Authors: Keith Pedersen
Comments: 11 pages, 3 figures, 2 algorithms
Subjects: Computation (stat.CO); Data Analysis, Statistics and Probability (physics.data-an)
[54]  arXiv:1705.06073 (replaced) [pdf, other]
Title: Superfast Line Spectral Estimation
Comments: 16 pages, 7 figures, accepted for IEEE Transactions on Signal Processing
Subjects: Information Theory (cs.IT); Applications (stat.AP)
[55]  arXiv:1705.08415 (replaced) [pdf, other]
Title: Community Detection with Graph Neural Networks
Authors: Joan Bruna, Xiang Li
Subjects: Machine Learning (stat.ML)
[56]  arXiv:1706.03471 (replaced) [pdf, other]
Title: YellowFin and the Art of Momentum Tuning
Comments: Updated to reflect improved stability discussion and work for SysML presentation
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI)
[57]  arXiv:1706.06066 (replaced) [pdf, other]
Title: On Quadratic Convergence of DC Proximal Newton Algorithm for Nonconvex Sparse Learning in High Dimensions
Comments: 36 pages, 5 figures, 1 table, Accepted at NIPS 2017
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Optimization and Control (math.OC)
[58]  arXiv:1706.10295 (replaced) [pdf, other]
Title: Noisy Networks for Exploration
Comments: ICLR 2018
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
[59]  arXiv:1707.07113 (replaced) [pdf, other]
Title: Adversarial Variational Optimization of Non-Differentiable Simulators
Subjects: Machine Learning (stat.ML); Learning (cs.LG)
[60]  arXiv:1708.00829 (replaced) [pdf, ps, other]
Title: Complexity Results for MCMC derived from Quantitative Bounds
Subjects: Computation (stat.CO); Probability (math.PR)
[61]  arXiv:1709.06360 (replaced) [pdf, ps, other]
Title: Minimax lower bounds for function estimation on graphs
Subjects: Statistics Theory (math.ST)
[62]  arXiv:1709.06853 (replaced) [pdf, other]
Title: Bandits with Delayed, Aggregated Anonymous Feedback
Subjects: Machine Learning (stat.ML); Learning (cs.LG)
[63]  arXiv:1709.10433 (replaced) [pdf, other]
Title: On the Capacity of Face Representation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[64]  arXiv:1710.06451 (replaced) [pdf, other]
Title: A Bayesian Perspective on Generalization and Stochastic Gradient Descent
Comments: 13 pages, 9 figures. Published as a conference paper at ICLR 2018
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[65]  arXiv:1711.05360 (replaced) [pdf, other]
Title: The Dispersion Bias
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
[66]  arXiv:1712.01193 (replaced) [pdf, other]
Title: A dual framework for trace norm regularized low-rank tensor completion
Comments: Title changed from earlier version, a shorter version appeared in the NIPS workshop on Synergies in Geometric Data Analysis 2017
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
[67]  arXiv:1802.03653 (replaced) [pdf, ps, other]
Title: On Symplectic Optimization
Comments: 20 pages, 5 figures
Subjects: Computation (stat.CO)
[68]  arXiv:1802.04784 (replaced) [pdf, ps, other]
Title: MONK -- Outlier-Robust Mean Embedding Estimation by Median-of-Means
Comments: 11 pages
Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Functional Analysis (math.FA); Statistics Theory (math.ST)
[69]  arXiv:1802.04826 (replaced) [pdf, other]
Title: Leveraging the Exact Likelihood of Deep Latent Variables Models
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Methodology (stat.ME)
[70]  arXiv:1802.04956 (replaced) [pdf, ps, other]
Title: D2KE: From Distance to Kernel and Embedding
Comments: 18 pages, 4 tables
Subjects: Machine Learning (stat.ML); Learning (cs.LG)
[71]  arXiv:1802.05141 (replaced) [pdf, other]
Title: Deep Learning and Data Assimilation for Real-Time Production Prediction in Natural Gas Wells
Comments: Reduced length preprint submitted to IJCAI 2018 for review
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn); Geophysics (physics.geo-ph); Machine Learning (stat.ML)
[72]  arXiv:1802.05155 (replaced) [pdf, other]
Title: Toward Deeper Understanding of Nonconvex Stochastic Optimization with Momentum using Diffusion Approximations
Subjects: Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
[ total of 72 entries: 1-72 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)