Statistics
New submissions
[ showing up to 2000 entries per page: fewer  more ]
New submissions for Fri, 16 Feb 18
 [1] arXiv:1802.05292 [pdf, ps, other]

Title: Flexible and objective time series analysis: a lossbased approach with twopiece locationscale distributionsComments: 26 pages, 6 FiguresSubjects: Methodology (stat.ME); Other Statistics (stat.OT)
Twopiece locationscale models are used for modeling data presenting departures from symmetry. In this paper, we propose an objective Bayesian methodology for the tail parameter of two particular distributions of the above family: the skewed exponential power distribution and the skewed generalised logistic distribution. We apply the proposed objective approach to time series models and linear regression models where the error terms follow the distributions object of study. The performance of the proposed approach is illustrated through simulation experiments and real data analysis. The methodology yields improvements in density forecasts, as shown by the analysis we carry out on the electricity prices in Nordpool markets.
 [2] arXiv:1802.05342 [pdf, other]

Title: Spatial Coherence of Oriented White Matter Microstructure: Applications to White Matter Regions Associated with Genetic SimilarityJournalref: NeuroImage (2018)Subjects: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (qbio.QM)
We present a method to discover differences between populations with respect to the spatial coherence of their oriented white matter microstructure in arbitrarily shaped white matter regions. This method is applied to diffusion MRI scans of a subset of the Human Connectome Project dataset: 57 pairs of monozygotic and 52 pairs of dizygotic twins. After controlling for morphological similarity between twins, we identify 3.7% of all white matter as being associated with genetic similarity (35.1k voxels, $p < 10^{4}$, false discovery rate 1.5%), 75% of which spatially clusters into twentytwo contiguous white matter regions. Furthermore, we show that the orientation similarity within these regions generalizes to a subset of 47 pairs of nontwin siblings, and show that these siblings are on average as similar as dizygotic twins. The regions are located in deep white matter including the superior longitudinal fasciculus, the optic radiations, the middle cerebellar peduncle, the corticospinal tract, and within the anterior temporal lobe, as well as the cerebellum, brain stem, and amygdalae.
These results extend previous work using undirected fractional anisotrophy for measuring putative heritable influences in white matter. Our multidirectional extension better accounts for crossing fiber connections within voxels. This bottom up approach has at its basis a novel measurement of coherence within neighboring voxel dyads between subjects, and avoids some of the fundamental ambiguities encountered with tractographic approaches to white matter analysis that estimate global connectivity.  [3] arXiv:1802.05355 [pdf, other]

Title: The Role of Information Complexity and Randomization in Representation LearningComments: 35 pages, 3 figures. Submitted for publicationSubjects: Machine Learning (stat.ML); Learning (cs.LG)
A grand challenge in representation learning is to learn the different explanatory factors of variation behind the high dimen sional data. Encoder models are often determined to optimize performance on training data when the real objective is to generalize well to unseen data. Although there is enough numerical evidence suggesting that noise injection (during training) at the representation level might improve the generalization ability of encoders, an informationtheoretic understanding of this principle remains elusive. This paper presents a sampledependent bound on the generalization gap of the crossentropy loss that scales with the information complexity (IC) of the representations, meaning the mutual information between inputs and their representations. The IC is empirically investigated for standard multilayer neural networks with SGD on MNIST and CIFAR10 datasets; the behaviour of the gap and the IC appear to be in direct correlation, suggesting that SGD selects encoders to implicitly minimize the IC. We specialize the IC to study the role of Dropout on the generalization capacity of deep encoders which is shown to be directly related to the encoder capacity, being a measure of the distinguishability among samples from their representations. Our results support some recent regularization methods.
 [4] arXiv:1802.05370 [pdf, other]

Title: Covariance Function PreTraining with mKernels for Accelerated Bayesian OptimisationAuthors: Alistair Shilton, Sunil Gupta, Santu Rana, Pratibha Vellanki, Cheng Li, Laurence Park, Svetha Venkatesh, Alessandra Sutti, David Rubin, Thomas Dorin, Alireza Vahid, Murray HeightSubjects: Machine Learning (stat.ML)
The paper presents a novel approach to direct covariance function learning for Bayesian optimisation, with particular emphasis on experimental design problems where an existing corpus of condensed knowledge is present. The method presented borrows techniques from reproducing kernel Banach space theory (specifically mkernels) and leverages them to convert (or reweight) existing covariance functions into new, problemspecific covariance functions. The key advantage of this approach is that rather than relying on the user to manually select (with some hyperparameter tuning and experimentation) an appropriate covariance function it constructs the covariance function to specifically match the problem at hand. The technique is demonstrated on two realworld problems  specifically alloy design and carbonfibre manufacturing  as well as a selected test function.
 [5] arXiv:1802.05400 [pdf]

Title: High Dimensional Bayesian Optimization Using DropoutComments: 7 pages; Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence 2017Subjects: Machine Learning (stat.ML)
Scaling Bayesian optimization to high dimensions is challenging task as the global optimization of highdimensional acquisition function can be expensive and often infeasible. Existing methods depend either on limited active variables or the additive form of the objective function. We propose a new method for highdimensional Bayesian optimization, that uses a dropout strategy to optimize only a subset of variables at each iteration. We derive theoretical bounds for the regret and show how it can inform the derivation of our algorithm. We demonstrate the efficacy of our algorithms for optimization on two benchmark functions and two realworld applications training cascade classifiers and optimizing alloy composition.
 [6] arXiv:1802.05431 [pdf, other]

Title: On the Theory of Variance Reduction for Stochastic Gradient Monte CarloComments: 37 pages; 4 figuresSubjects: Machine Learning (stat.ML); Learning (cs.LG)
We provide convergence guarantees in Wasserstein distance for a variety of variancereduction methods: SAGA Langevin diffusion, SVRG Langevin diffusion and controlvariate underdamped Langevin diffusion. We analyze these methods under a uniform set of assumptions on the logposterior distribution, assuming it to be smooth, strongly convex and Hessian Lipschitz. This is achieved by a new proof technique combining ideas from finitesum optimization and the analysis of sampling methods. Our sharp theoretical bounds allow us to identify regimes of interest where each method performs better than the others. Our theory is verified with experiments on realworld and synthetic datasets.
 [7] arXiv:1802.05444 [pdf, other]

Title: A Weighted Likelihood Approach Based on Statistical Data DepthsAuthors: Claudio AgostinelliSubjects: Methodology (stat.ME)
We propose a general approach to construct weighted likelihood estimating equations with the aim of obtain robust estimates. The weight, attached to each score contribution, is evaluated by comparing the statistical data depth at the model with that of the sample in a given point. Observations are considered regular when the ratio of these two depths is close to one, whereas, when the ratio is large the corresponding score contribution may be downweigthed. Details and examples are provided for the robust estimation of the parameters in the multivariate normal model. Because of the form of the weights, we expect that, there will be no downweighting under the true model leading to highly efficient estimators. Robustness is illustrated using two real data sets.
 [8] arXiv:1802.05447 [pdf, other]

Title: History PCA: A New Algorithm for Streaming PCASubjects: Machine Learning (stat.ML)
In this paper we propose a new algorithm for streaming principal component analysis. With limited memory, small devices cannot store all the samples in the highdimensional regime. Streaming principal component analysis aims to find the $k$dimensional subspace which can explain the most variation of the $d$dimensional data points that come into memory sequentially. In order to deal with large $d$ and large $N$ (number of samples), most streaming PCA algorithms update the current model using only the incoming sample and then dump the information right away to save memory. However the information contained in previously streamed data could be useful. Motivated by this idea, we develop a new streaming PCA algorithm called History PCA that achieves this goal. By using $O(Bd)$ memory with $B\approx 10$ being the block size, our algorithm converges much faster than existing streaming PCA algorithms. By changing the number of inner iterations, the memory usage can be further reduced to $O(d)$ while maintaining a comparable convergence speed. We provide theoretical guarantees for the convergence of our algorithm along with the rate of convergence. We also demonstrate on synthetic and real world data sets that our algorithm compares favorably with other stateoftheart streaming PCA methods in terms of the convergence speed and performance.
 [9] arXiv:1802.05451 [pdf, other]

Title: Mapping Images to Scene Graphs with PermutationInvariant Structured PredictionSubjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
Structured prediction is concerned with predicting multiple interdependent labels simultaneously. Classical methods like CRF achieve this by maximizing a score function over the set of possible label assignments. Recent extensions use neural networks to either implement the score function or in maximization. The current paper takes an alternative approach, using a neural network to generate the structured output directly, without going through a score function. We take an axiomatic perspective to derive the desired properties and invariances of a such network to certain input permutations, presenting a structural characterization that is provably both necessary and sufficient. We then discuss graphpermutation invariant (GPI) architectures that satisfy this characterization and explain how they can be used for deep structured prediction. We evaluate our approach on the challenging problem of inferring a {\em scene graph} from an image, namely, predicting entities and their relations in the image. We obtain stateoftheart results on the challenging Visual Genome benchmark, outperforming all recent approaches.
 [10] arXiv:1802.05475 [pdf, ps, other]

Title: Robust and sparse Gaussian graphical modeling under cellwise contaminationSubjects: Methodology (stat.ME)
Graphical modeling explores dependences among a collection of variables by inferring a graph that encodes pairwise conditional independences. For jointly Gaussian variables, this translates into detecting the support of the precision matrix. Many modern applications feature highdimensional and contaminated data that complicate this task. In particular, traditional robust methods that downweight entire observation vectors are often inappropriate as highdimensional data may feature partial contamination in many observations. We tackle this problem by giving a robust method for sparse precision matrix estimation based on the $\gamma$divergence under a cellwise contamination model. Simulation studies demonstrate that our procedure outperforms existing methods especially for highly contaminated data.
 [11] arXiv:1802.05495 [pdf, other]

Title: An Operational (Preasymptotic) Measure of FattailednessAuthors: Nassim Nicholas TalebSubjects: Methodology (stat.ME); Statistical Finance (qfin.ST)
This note presents an operational measure of fattailedness for univariate probability distributions, in $[0,1]$ where 0 is maximally thintailed (Gaussian) and 1 is maximally fattailed.
Among others,1) it helps assess the sample size needed to establish a comparative $n$ needed for statistical significance, 2) allows practical comparisons across classes of fattailed distributions, 3) helps understand some inconsistent attributes of the lognormal, pending on the parametrization of its scale parameter.
The literature is rich for what concerns asymptotic behavior, but there is a large void for finite values of $n$, those needed for operational purposes. Conventional measures of fattailedness, namely 1) the tail index for the power law class, and 2) Kurtosis for finite moment distributions fail to apply to some distributions, and do not allow comparisons across classes and parametrization, that is between power laws outside the LevyStable basin, or power laws to distributions in other classes, or power laws for different number of summands. How can one compare a sum of 100 Student T distributed random variables with 3 degrees of freedom to one in a LevyStable or a Lognormal class? How can one compare a sum of 100 Student T with 3 degrees of freedom to a single Student T with 2 degrees of freedom?
We propose an operational and heuristic measure that allow us to compare $n$summed independent variables under all distributions with finite first moment. The method is based on the rate of convergence of the Law of Large numbers for finite sums, $n$summands specifically.
We get either explicit expressions or simulation results and bounds for the lognormal, exponential, Pareto, and the Student T distributions in their various calibrations in addition to the general Pearson classes.  [12] arXiv:1802.05530 [pdf, other]

Title: Modelling spatial heterogeneity and discontinuities using Voronoi tessellationsAuthors: Christopher A. Pope, John Paul Gosling, Stuart Barber, Jill Johnson, Takanobu Yamaguchi, Graham Feingold, Paul BlackwellSubjects: Methodology (stat.ME)
Many methods for modelling spatial processes assume global smoothness properties; such assumptions are often violated in practice. We introduce a method for modelling spatial processes that display heterogeneity or contain discontinuities. The problem of nonstationarity is dealt with by using a combination of Voronoi tessellation to partition the input space, and a separate Gaussian process to model the data on each region of the partitioned space. Our method is highly flexible because we allow the Voronoi cells to form relationships with each other, which can enable nonconvex and disconnected regions to be considered. In such problems, identifying the borders between regions is often of great importance and we propose an adaptive sampling method to gain extra information along such borders. The method is illustrated with simulation studies and application to real data.
 [13] arXiv:1802.05550 [pdf, other]

Title: ICA based on Split Generalized GaussianComments: arXiv admin note: substantial text overlap with arXiv:1701.09160Subjects: Machine Learning (stat.ML)
Independent Component Analysis (ICA)  one of the basic tools in data analysis  aims to find a coordinate system in which the components of the data are independent. Most popular ICA methods use kurtosis as a metric of nonGaussianity to maximize, such as FastICA and JADE. However, their assumption of fourthorder moment (kurtosis) may not always be satisfied in practice. One of the possible solution is to use thirdorder moment (skewness) instead of kurtosis, which was applied in $ICA_{SG}$ and EcoICA.
In this paper we present a competitive approach to ICA based on the Split Generalized Gaussian distribution (SGGD), which is well adapted to heavytailed as well as asymmetric data. Consequently, we obtain a method which works better than the classical approaches, in both cases: heavy tails and nonsymmetric data. \end{abstract}  [14] arXiv:1802.05570 [pdf, other]

Title: Optimal Transport: Fast Probabilistic Approximation with Exact SolversSubjects: Computation (stat.CO); Methodology (stat.ME)
We propose a simple subsampling scheme for fast randomized approximate computation of optimal transport distances. This scheme operates on a random subset of the full data and can use any exact algorithm as a blackbox backend, including stateoftheart solvers and entropically penalized versions. It is based on averaging the exact distances between empirical measures generated from independent samples from the original measures and can easily be tuned towards higher accuracy or shorter computation times. To this end, we give nonasymptotic deviation bounds for its accuracy. In particular, we show that in many important cases, including images, the approximation error is independent of the size of the full problem. We present numerical experiments that demonstrate that a very good approximation in typical applications can be obtained in a computation time that is several orders of magnitude smaller than what is required for exact computation of the full problem.
 [15] arXiv:1802.05584 [pdf, other]

Title: Convolutional Analysis Operator Learning: Acceleration, Convergence, Application, and Neural NetworksComments: 19 pages, 9 figuresSubjects: Machine Learning (stat.ML); Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
Convolutional operator learning is increasingly gaining attention in many signal processing and computer vision applications. Learning kernels has mostly relied on socalled local approaches that extract and store many overlapping patches across training signals. Due to memory demands, local approaches have limitations when learning kernels from large datasets  particularly with multilayered structures, e.g., convolutional neural network (CNN)  and/or applying the learned kernels to highdimensional signal recovery problems. The socalled global approach has been studied within the "synthesis" signal model, e.g., convolutional dictionary learning, overcoming the memory problems by careful algorithmic designs. This paper proposes a new convolutional analysis operator learning (CAOL) framework in the global approach, and develops a new convergent Block Proximal Gradient method using a Majorizer (BPGM) to solve the corresponding block multinonconvex problems. To learn diverse filters within the CAOL framework, this paper introduces an orthogonality constraint that enforces a tightframe (TF) filter condition, and a regularizer that promotes diversity between filters. Numerical experiments show that, for tight majorizers, BPGM significantly accelerates the CAOL convergence rate compared to the stateoftheart method, BPG. Numerical experiments for sparseview computational tomography show that CAOL using TF filters significantly improves reconstruction quality compared to a conventional edgepreserving regularizer. Finally, this paper shows that CAOL can be useful to mathematically model a CNN, and the corresponding updates obtained via BPGM coincide with core modules of the CNN.
 [16] arXiv:1802.05622 [pdf, other]

Title: Conditioning of threedimensional generative adversarial networks for pore and reservoirscale modelsComments: 5 pages, 2 figuresSubjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geoph)
Geostatistical modeling of petrophysical properties is a key step in modern integrated oil and gas reservoir studies. Recently, generative adversarial networks (GAN) have been shown to be a successful method for generating unconditional simulations of pore and reservoirscale models. This contribution leverages the differentiable nature of neural networks to extend GANs to the conditional simulation of threedimensional pore and reservoirscale models. Based on the previous work of Yeh et al. (2016), we use a content loss to constrain to the conditioning data and a perceptual loss obtained from the evaluation of the GAN discriminator network. The technique is tested on the generation of threedimensional microCT images of a Ketton limestone constrained by twodimensional crosssections, and on the simulation of the Maules Creek alluvial aquifer constrained by onedimensional sections. Our results show that GANs represent a powerful method for sampling conditioned pore and reservoir samples for stochastic reservoir evaluation workflows.
 [17] arXiv:1802.05631 [pdf, other]

Title: Direct Estimation of Differences in Causal GraphsSubjects: Methodology (stat.ME)
We consider the problem of estimating the differences between two causal directed acyclic graph (DAG) models given i.i.d. samples from each model. This is of interest for example in genomics, where largescale gene expression data is becoming available under different cellular contexts, for different cell types, or disease states. Changes in the structure or edge weights of the underlying causal graphs reflect alterations in the gene regulatory networks and provide important insights into the emergence of a particular phenotype. While the individual networks are usually very large, containing highdegree hub nodes and thus difficult to learn, the overall change between two related networks can be sparse. We here provide the first provably consistent method for directly estimating the differences in a pair of causal DAGs without separately learning two possibly large and dense DAG models and computing their difference. Our twostep algorithm first uses invariance tests between regression coefficients of the two data sets to estimate the skeleton of the difference graph and then orients some of the edges using invariance tests between regression residual variances. We demonstrate the properties of our method through a simulation study and apply it to the analysis of gene expression data from ovarian cancer and during Tcell activation.
 [18] arXiv:1802.05635 [pdf, ps, other]

Title: Nonparametric Bayesian posterior contraction rates for scalar diffusions with highfrequency dataAuthors: Kweku AbrahamSubjects: Statistics Theory (math.ST)
We consider inference in the scalar diffusion model $dX_t=b(X_t)dt+\sigma(X_t)dW_t$ with discrete data $(X_{j\Delta_n})_{0\leq j \leq n}$, $n\to \infty,~\Delta_n\to 0$ and periodic coefficients. For $\sigma$ given, we prove a general theorem detailing conditions under which Bayesian posteriors will contract in $L^2$distance around the true drift function $b_0$ at the frequentist minimax rate (up to logarithmic factors) over Besov smoothness classes. We exhibit natural nonparametric priors which satisfy our conditions. Our results show that the Bayesian method adapts both to an unknown sampling regime and to unknown smoothness.
 [19] arXiv:1802.05650 [pdf, other]

Title: Ranks and PseudoRanks  Paradoxical Results of Rank Tests Comments: 19 pages, 0 figuresSubjects: Statistics Theory (math.ST)
Rankbased inference methods are applied in various disciplines, typically when procedures relying on standard normal theory are not justifiable, for example when data are not symmetrically distributed, contain outliers, or responses are even measured on ordinal scales. Various specific rankbased methods have been developed for two and more samples, and also for general factorial designs (e.g., KruskalWallis test, JonckheereTerpstra test). It is the aim of the present paper (1) to demonstrate that traditional rankprocedures for several samples or general factorial designs may lead to paradoxical results in case of unbalanced samples, (2) to explain why this is the case, and (3) to provide a way to overcome these disadvantages of traditional rankbased inference. Theoretical investigations show that the paradoxical results can be explained by carefully considering the noncentralities of the test statistics which may be nonzero for the traditional tests in unbalanced designs. These noncentralities may even become arbitrarily large for increasing sample sizes in the unbalanced case. A simple solution is the use of socalled pseudoranks instead of ranks. As a special case, we illustrate the effects in subgroup analyses which are often used when dealing with rare diseases.
 [20] arXiv:1802.05664 [pdf, other]

Title: DeepMatch: Balancing Deep Covariate Representations for Causal Inference Using Adversarial TrainingAuthors: Nathan KallusSubjects: Machine Learning (stat.ML)
We study optimal covariate balance for causal inferences from observational data when rich covariates and complex relationships necessitate flexible modeling with neural networks. Standard approaches such as propensity weighting and matching/balancing fail in such settings due to miscalibrated propensity nets and inappropriate covariate representations, respectively. We propose a new method based on adversarial training of a weighting and a discriminator network that effectively addresses this methodological gap. This is demonstrated through new theoretical characterizations of the method as well as empirical results using both fully connected architectures to learn complex relationships and convolutional architectures to handle image confounders, showing how this new method can enable strong causal analyses in these challenging settings.
 [21] arXiv:1802.05680 [pdf, other]

Title: Constraining the Dynamics of Deep Probabilistic ModelsComments: 12 pagesSubjects: Machine Learning (stat.ML)
We introduce a novel generative formulation of deep probabilistic models implementing "soft" constraints on the dynamics of the functions they can model. In particular we develop a flexible methodological framework where the modeled functions and derivatives of a given order are subject to inequality or equality constraints. We characterize the posterior distribution over model and constraint parameters through stochastic variational inference techniques. As a result, the proposed approach allows for accurate and scalable uncertainty quantification of predictions and parameters. We demonstrate the application of equality constraints in the challenging problem of parameter inference in ordinary differential equation models, while we showcase the application of inequality constraints on monotonic regression on count data. The proposed approach is extensively tested in several experimental settings, leading to highly competitive results in challenging modeling applications, while offering high expressiveness, flexibility and scalability.
 [22] arXiv:1802.05688 [pdf, other]

Title: Simulation assisted machine learningSubjects: Machine Learning (stat.ML); Learning (cs.LG); Quantitative Methods (qbio.QM)
Predicting how a proposed cancer treatment will affect a given tumor can be cast as a machine learning problem, but the complexity of biological systems, the number of potentially relevant genomic and clinical features, and the lack of very large scale patient data repositories make this a unique challenge. "Pure data" approaches to this problem are underpowered to detect combinatorially complex interactions and are bound to uncover false correlations despite statistical precautions taken (1). To investigate this setting, we propose a method to integrate simulations, a strong form of prior knowledge, into machine learning, a combination which to date has been largely unexplored. The results of multiple simulations (under various uncertainty scenarios) are used to compute similarity measures between every pair of samples: sample pairs are given a high similarity score if they behave similarly under a wide range of simulation parameters. These similarity values, rather than the original high dimensional feature data, are used to train kernelized machine learning algorithms such as support vector machines, thus handling the curseofdimensionality that typically affects genomic machine learning. Using four synthetic datasets of complex systemsthree biological models and one network flow optimization modelwe demonstrate that when the number of training samples is small compared to the number of features, the simulation kernel approach dominates over nopriorknowledge methods. In addition to biology and medicine, this approach should be applicable to other disciplines, such as weather forecasting, financial markets, and agricultural management, where predictive models are sought and informative yet approximate simulations are available. The Python SimKern software, the models (in MATLAB, Octave, and R), and the datasets are made freely available at https://github.com/davidcraft/SimKern.
Crosslists for Fri, 16 Feb 18
 [23] arXiv:1705.01166 (crosslist from physics.dataan) [pdf, other]

Title: Maximizing the information learned from finite data selects a simple modelComments: 9 pages, 8 figures. v3 has improved discussion and adds an appendix about MDL and Bayes factors, and matches version to appear in PNAS (modulo comma placement). Title changed from "Rational Ignorance: Simpler Models Learn More Information from Finite Data"Journalref: PNAS February 2018Subjects: Data Analysis, Statistics and Probability (physics.dataan); Statistical Mechanics (condmat.statmech); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
We use the language of uninformative Bayesian prior choice to study the selection of appropriately simple effective models. We advocate for the prior which maximizes the mutual information between parameters and predictions, learning as much as possible from limited data. When many parameters are poorly constrained by the available data, we find that this prior puts weight only on boundaries of the parameter manifold. Thus it selects a lowerdimensional effective theory in a principled way, ignoring irrelevant parameter directions. In the limit where there is sufficient data to tightly constrain any number of parameters, this reduces to Jeffreys prior. But we argue that this limit is pathological when applied to the hyperribbon parameter manifolds generic in science, because it leads to dramatic dependence on effects invisible to experiment.
 [24] arXiv:1802.05312 (crosslist from cs.LG) [pdf, other]

Title: Learning Deep Disentangled Embeddings with the FStatistic LossSubjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Deepembedding methods aim to discover representations of a domain that make explicit the domain's class structure. Disentangling methods aim to make explicit compositional or factorial structure. We combine these two active but independent lines of research and propose a new paradigm for discovering disentangled representations of class structure; these representations reveal the underlying factors that jointly determine class. We propose and evaluate a novel loss function based on the $F$ statistic, which describes the separation of two or more distributions. By ensuring that distinct classes are well separated on a subset of embedding dimensions, we obtain embeddings that are useful for fewshot learning. By not requiring separation on all dimensions, we encourage the discovery of disentangled representations. Our embedding procedure matches or beats stateoftheart procedures on deep embeddings, as evaluated by performance on recall@$k$ and fewshot learning tasks. To evaluate alternative approaches on disentangling, we formalize two key properties of a disentangled representation: modularity and explicitness. By these criteria, our procedure yields disentangled representations, whereas traditional procedures fail. The goal of our work is to obtain more interpretable, manipulable, and generalizable deep representations of concepts and categories.
 [25] arXiv:1802.05313 (crosslist from cs.AI) [pdf, other]

Title: Reinforcement Learning from Imperfect DemonstrationsSubjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)
Robust realworld learning should benefit from both demonstrations and interactions with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on the reward received from the environment. These tasks have divergent losses which are difficult to jointly optimize and such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm, Normalized ActorCritic (NAC), that effectively normalizes the Qfunction, reducing the Qvalues of actions unseen in the demonstration data. NAC learns an initial policy network from demonstrations and refines the policy in the environment, surpassing the demonstrator's performance. Crucially, both learning from demonstration and interactive refinement use the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.
 [26] arXiv:1802.05315 (crosslist from cs.LG) [pdf, other]

Title: Online Learning for NonStationary A/B TestsSubjects: Learning (cs.LG); Machine Learning (stat.ML)
The rollout of new versions of a feature in modern applications is a manual multistage process, as the feature is released to ever larger groups of users, while its performance is carefully monitored. This kind of A/B testing is ubiquitous, but suboptimal, as the monitoring requires heavy human intervention, is not guaranteed to capture consistent, but shortterm fluctuations in performance, and is inefficient, as better versions take a long time to reach the full population.
In this work we formulate this question as that of expert learning, and give a new algorithm FollowTheBestInterval, FTBI, that works in dynamic, nonstationary environments. Our approach is practical, simple, and efficient, and has rigorous guarantees on its performance. Finally, we perform a thorough evaluation on synthetic and real world datasets and show that our approach outperforms current stateoftheart methods.  [27] arXiv:1802.05319 (crosslist from cs.SE) [pdf, other]

Title: 500+ Times Faster Than Deep Learning (A Case Study Exploring Faster Methods for Text Mining StackOverflow)Subjects: Software Engineering (cs.SE); Learning (cs.LG); Machine Learning (stat.ML)
Deep learning methods are useful for highdimensional data and are becoming widely used in many areas of software engineering. Deep learners utilizes extensive computational power and can take a long time to train making it difficult to widely validate and repeat and improve their results. Further, they are not the best solution in all domains. For example, recent results show that for finding related Stack Overflow posts, a tuned SVM performs similarly to a deep learner, but is significantly faster to train. This paper extends that recent result by clustering the dataset, then tuning very learners within each cluster. This approach is over 500 times faster than deep learning (and over 900 times faster if we use all the cores on a standard laptop computer). Significantly, this faster approach generates classifiers nearly as good (within 2\% F1 Score) as the much slower deep learning method. Hence we recommend this faster methods since it is much easier to reproduce and utilizes far fewer CPU resources. More generally, we recommend that before researchers release research results, that they compare their supposedly sophisticated methods against simpler alternatives (e.g applying simpler learners to build local models).
 [28] arXiv:1802.05333 (crosslist from econ.EM) [pdf, ps, other]

Title: BootstrapAssisted Unit Root Testing With Piecewise Locally Stationary ErrorsComments: This paper has been accepted for publication and will appear in a revised form, subsequent to editorial input by Cambridge University Press, in Econometric TheorySubjects: Econometrics (econ.EM); Statistics Theory (math.ST)
In unit root testing, a piecewise locally stationary process is adopted to accommodate nonstationary errors that can have both smooth and abrupt changes in second or higherorder properties. Under this framework, the limiting null distributions of the conventional unit root test statistics are derived and shown to contain a number of unknown parameters. To circumvent the difficulty of direct consistent estimation, we propose to use the dependent wild bootstrap to approximate the nonpivotal limiting null distributions and provide a rigorous theoretical justification for bootstrap consistency. The proposed method is compared through finite sample simulations with the recolored wild bootstrap procedure, which was developed for errors that follow a heteroscedastic linear process. Further, a combination of autoregressive sieve recoloring with the dependent wild bootstrap is shown to perform well. The validity of the dependent wild bootstrap in a nonstationary setting is demonstrated for the first time, showing the possibility of extensions to other inference problems associated with locally stationary processes.
 [29] arXiv:1802.05335 (crosslist from cs.LG) [pdf, other]

Title: Multimodal Generative Models for Scalable WeaklySupervised LearningComments: 8 pages, 10 figuresSubjects: Learning (cs.LG); Machine Learning (stat.ML)
Multiple modalities often cooccur when describing natural phenomena. Learning a joint representation of these modalities should yield deeper and more useful representations. Previous work have proposed generative models to handle multimodal input. However, these models either do not learn a joint distribution or require complex additional computations to handle missing data. Here, we introduce a multimodal variational autoencoder that uses a productofexperts inference network and a subsampled training paradigm to solve the multimodal inference problem. Notably, our model shares parameters to efficiently learn under any combination of missing modalities, thereby enabling weaklysupervised learning. We apply our method on four datasets and show that we match stateoftheart performance using many fewer parameters. In each case our approach yields strong weaklysupervised results. We then consider a case study of learning image transformationsedge detection, colorization, facial landmark segmentation, etc.as a set of modalities. We find appealing results across this range of tasks.
 [30] arXiv:1802.05339 (crosslist from physics.dataan) [pdf, other]

Title: Two and Multidimensional Curve Fitting using Bayesian InferenceAuthors: Andrew W. SteinerSubjects: Data Analysis, Statistics and Probability (physics.dataan); Instrumentation and Methods for Astrophysics (astroph.IM); Statistics Theory (math.ST)
Fitting models to data using Bayesian inference is quite common, but when each point in parameter space gives a curve, fitting the curve to a data set requires new nuisance parameters, which specify the metric embedding the onedimensional curve into the higherdimensional space occupied by the data. A generic formalism for curve fitting in the context of Bayesian inference is developed which shows how the aforementioned metric arises. The result is a natural generalization of previous works, and is compared to oftused frequentist approaches and similar Bayesian techniques.
 [31] arXiv:1802.05351 (crosslist from cs.CR) [pdf, other]

Title: Stealing Hyperparameters in Machine LearningComments: To appear in the Proceedings of the IEEE Symposium on Security and Privacy, May 2018Subjects: Cryptography and Security (cs.CR); Learning (cs.LG); Machine Learning (stat.ML)
Hyperparameters are critical in machine learning, as different hyperparameters often result in models with significantly different performance. Hyperparameters may be deemed confidential because of their commercial value and the confidentiality of the proprietary algorithms that the learner uses to learn them. In this work, we propose attacks on stealing the hyperparameters that are learned by a learner. We call our attacks hyperparameter stealing attacks. Our attacks are applicable to a variety of popular machine learning algorithms such as ridge regression, logistic regression, support vector machine, and neural network. We evaluate the effectiveness of our attacks both theoretically and empirically. For instance, we evaluate our attacks on Amazon Machine Learning. Our results demonstrate that our attacks can accurately steal hyperparameters. We also study countermeasures. Our results highlight the need for new defenses against our hyperparameter stealing attacks for certain machine learning algorithms.
 [32] arXiv:1802.05374 (crosslist from math.OC) [pdf, other]

Title: A Progressive Batching LBFGS Method for Machine LearningAuthors: Raghu Bollapragada, Dheevatsa Mudigere, Jorge Nocedal, HaoJun Michael Shi, Ping Tak Peter TangComments: 25 pages, 17 figures, 2 tablesSubjects: Optimization and Control (math.OC); Learning (cs.LG); Machine Learning (stat.ML)
The standard LBFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasiNewton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization properties, LBFGS is currently not considered an algorithm of choice for largescale machine learning applications. One need not, however, choose between the two extremes represented by the full batch or highly stochastic regimes, and may instead follow a progressive batching approach in which the sample size increases during the course of the optimization. In this paper, we present a new version of the LBFGS algorithm that combines three basic components  progressive batching, a stochastic line search, and stable quasiNewton updating  and that performs well on training logistic regression and deep neural networks. We provide supporting convergence theory for the method.
 [33] arXiv:1802.05380 (crosslist from cs.LG) [pdf, other]

Title: Active Feature Acquisition with Supervised Matrix CompletionComments: 9 pages, 8 figuresSubjects: Learning (cs.LG); Machine Learning (stat.ML)
Feature missing is a serious problem in many applications, which may lead to low quality of training data and further significantly degrade the learning performance. While feature acquisition usually involves special devices or complex process, it is expensive to acquire all feature values for the whole dataset. On the other hand, features may be correlated with each other, and some values may be recovered from the others. It is thus important to decide which features are most informative for recovering the other features as well as improving the learning performance. In this paper, we try to train an effective classification model with least acquisition cost by jointly performing active feature querying and supervised matrix completion. When completing the feature matrix, a novel target function is proposed to simultaneously minimize the reconstruction error on observed entries and the supervised loss on training data. When querying the feature value, the most uncertain entry is actively selected based on the variance of previous iterations. In addition, a biobjective optimization method is presented for costaware active selection when features bear different acquisition costs. The effectiveness of the proposed approach is well validated by both theoretical analysis and experimental study.
 [34] arXiv:1802.05386 (crosslist from cs.LG) [pdf]

Title: Shamap: Shapebased Manifold LearningSubjects: Learning (cs.LG); Machine Learning (stat.ML)
For manifold learning, it is assumed that highdimensional sample/data points are on an embedded lowdimensional manifold. Usually, distances among samples are computed to represent the underlying data structure, for a specified distance measure such as the Euclidean distance or geodesic distance. For manifold learning, here we propose a metric according to the angular change along a geodesic line, thereby reflecting the underlying shapeoriented information or the similarity between high and lowdimensional representations of a data cloud. Our numerical results are described to demonstrate the feasibility and merits of the proposed dimensionality reduction scheme
 [35] arXiv:1802.05392 (crosslist from cs.LG) [pdf, other]

Title: Reducing overclustering via the powered Chinese restaurant processSubjects: Learning (cs.LG); Machine Learning (stat.ML)
Dirichlet process mixture (DPM) models tend to produce many small clusters regardless of whether they are needed to accurately characterize the data  this is particularly true for large data sets. However, interpretability, parsimony, data storage and communication costs all are hampered by having overly many clusters. We propose a powered Chinese restaurant process to limit this kind of problem and penalize over clustering. The method is illustrated using some simulation examples and data with large and small sample size including MNIST and the Old Faithful Geyser data.
 [36] arXiv:1802.05394 (crosslist from cs.LG) [pdf, other]

Title: CostEffective Training of Deep CNNs with Active Model AdaptationComments: 9 pagesSubjects: Learning (cs.LG); Machine Learning (stat.ML)
Deep convolutional neural networks have achieved great success in various applications. However, training an effective DNN model for a specific task is rather challenging because it requires a prior knowledge or experience to design the network architecture, repeated trialanderror process to tune the parameters, and a large set of labeled data to train the model. In this paper, we propose to overcome these challenges by actively adapting a pretrained model to a new task with less labeled examples. Specifically, the pretrained model is iteratively fine tuned based on the most useful examples. The examples are actively selected based on a novel criterion, which jointly estimates the potential contribution of an instance on optimizing the feature representation as well as improving the classification model for the target task. On one hand, the pretrained model brings plentiful information from its original task, avoiding redesign of the network architecture or training from scratch; and on the other hand, the labeling cost can be significantly reduced by active label querying. Experiments on multiple datasets and different pretrained models demonstrate that the proposed approach can achieve costeffective training of DNNs.
 [37] arXiv:1802.05408 (crosslist from cs.IT) [pdf, ps, other]

Title: "Dependency Bottleneck" in Autoencoding Architectures: an Empirical StudySubjects: Information Theory (cs.IT); Learning (cs.LG); Machine Learning (stat.ML)
Recent works investigated the generalization properties in deep neural networks (DNNs) by studying the Information Bottleneck in DNNs. However, the mea surement of the mutual information (MI) is often inaccurate due to the density estimation. To address this issue, we propose to measure the dependency instead of MI between layers in DNNs. Specifically, we propose to use HilbertSchmidt Independence Criterion (HSIC) as the dependency measure, which can measure the dependence of two random variables without estimating probability densities. Moreover, HSIC is a special case of the Squaredloss Mutual Information (SMI). In the experiment, we empirically evaluate the generalization property using HSIC in both the reconstruction and prediction autoencoding (AE) architectures.
 [38] arXiv:1802.05411 (crosslist from cs.LG) [pdf, ps, other]

Title: Selecting the Best in GANs Family: a Post Selection Inference FrameworkAuthors: YaoHung Hubert Tsai, Makoto Yamada, Denny Wu, Ruslan Salakhutdinov, Ichiro Takeuchi, Kenji FukumizuSubjects: Learning (cs.LG); Machine Learning (stat.ML)
"Which Generative Adversarial Networks (GANs) generates the most plausible images?" has been a frequently asked question among researchers. To address this problem, we first propose an \emph{incomplete} Ustatistics estimate of maximum mean discrepancy $\mathrm{MMD}_{inc}$ to measure the distribution discrepancy between generated and real images. $\mathrm{MMD}_{inc}$ enjoys the advantages of asymptotic normality, computation efficiency, and model agnosticity. We then propose a GANs analysis framework to select and test the "best" member in GANs family using the Post Selection Inference (PSI) with $\mathrm{MMD}_{inc}$. In the experiments, we adopt the proposed framework on 7 GANs variants and compare their $\mathrm{MMD}_{inc}$ scores.
 [39] arXiv:1802.05429 (crosslist from cs.SD) [pdf, ps, other]

Title: Blind Source Separation with Optimal Transport Nonnegative Matrix FactorizationComments: 22 pages, 7 figures, 2 additional filesSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
Optimal transport as a loss for machine learning optimization problems has recently gained a lot of attention. Building upon recent advances in computational optimal transport, we develop an optimal transport nonnegative matrix factorization (NMF) algorithm for supervised speech blind source separation (BSS). Optimal transport allows us to design and leverage a cost between shorttime Fourier transform (STFT) spectrogram frequencies, which takes into account how humans perceive sound. We give empirical evidence that using our proposed optimal transport NMF leads to perceptually better results than Euclidean NMF, for both isolated voice reconstruction and BSS tasks. Finally, we demonstrate how to use optimal transport for cross domain sound processing tasks, where frequencies represented in the input spectrograms may be different from one spectrogram to another.
 [40] arXiv:1802.05472 (crosslist from cs.LG) [pdf]

Title: Admissible Time Series Motif Discovery with Missing DataSubjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The discovery of time series motifs has emerged as one of the most useful primitives in time series data mining. Researchers have shown its utility for exploratory data mining, summarization, visualization, segmentation, classification, clustering, and rule discovery. Although there has been more than a decade of extensive research, there is still no technique to allow the discovery of time series motifs in the presence of missing data, despite the welldocumented ubiquity of missing data in scientific, industrial, and medical datasets. In this work, we introduce a technique for motif discovery in the presence of missing data. We formally prove that our method is admissible, producing no false negatives. We also show that our method can piggyback off the fastest known motif discovery method with a small constant factor time/space overhead. We will demonstrate our approach on diverse datasets with varying amounts of missing data
 [41] arXiv:1802.05637 (crosslist from cs.LG) [pdf, other]

Title: cGANs with Projection DiscriminatorComments: Published as a conference paper at ICLR 2018Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
We propose a novel, projection based way to incorporate the conditional information into the discriminator of GANs that respects the role of the conditional information in the underlining probabilistic model. This approach is in contrast with most frameworks of conditional GANs used in application today, which use the conditional information by concatenating the (embedded) conditional vector to the feature vectors. With this modification, we were able to significantly improve the quality of the class conditional image generation on ILSVRC2012 (ImageNet) 1000class image dataset from the current stateoftheart result, and we achieved this with a single pair of a discriminator and a generator. We were also able to extend the application to superresolution and succeeded in producing highly discriminative superresolution images. This new structure also enabled high quality category transformation based on parametric functional transformation of conditional batch normalization layers in the generator.
 [42] arXiv:1802.05666 (crosslist from cs.LG) [pdf, ps, other]

Title: Adversarial Risk and the Dangers of Evaluating Against Weak AttacksSubjects: Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. The existence of adversarial examples in trained neural networks reflects the fact that expected risk alone does not capture the model's performance against worstcase inputs. We motivate the use of adversarial risk as an objective, although it cannot easily be computed exactly. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may be obscured to adversaries, by optimizing this surrogate rather than the true adversarial risk. We demonstrate that this is a significant problem in practice by repurposing gradientfree optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that our formulations and results will help researchers to develop more powerful defenses.
 [43] arXiv:1802.05693 (crosslist from cs.LG) [pdf, ps, other]

Title: Bandit Learning with Positive ExternalitiesComments: 27 pages, 1 tableSubjects: Learning (cs.LG); Machine Learning (stat.ML)
Many platforms are characterized by the fact that future user arrivals are likely to have preferences similar to users who were satisfied in the past. In other words, arrivals exhibit {\em positive externalities}. We study multiarmed bandit (MAB) problems with positive externalities. Our model has a finite number of arms and users are distinguished by the arm(s) they prefer. We model positive externalities by assuming that the preferred arms of future arrivals are selfreinforcing based on the experiences of past users. We show that classical algorithms such as UCB which are optimal in the classical MAB setting may even exhibit linear regret in the context of positive externalities. We provide an algorithm which achieves optimal regret and show that such optimal regret exhibits substantially different structure from that observed in the standard MAB setting.
 [44] arXiv:1802.05694 (crosslist from cs.CL) [pdf, other]

Title: Multinomial Adversarial Networks for MultiDomain Text ClassificationComments: NAACL 2018Subjects: Computation and Language (cs.CL); Learning (cs.LG); Machine Learning (stat.ML)
Many text classification tasks are known to be highly domaindependent. Unfortunately, the availability of training data can vary drastically across domains. Worse still, for some domains there may not be any annotated data at all. In this work, we propose a multinomial adversarial network (MAN) to tackle the text classification problem in this realworld multidomain setting (MDTC). We provide theoretical justifications for the MAN framework, proving that different instances of MANs are essentially minimizers of various fdivergence metrics (Ali and Silvey, 1966) among multiple probability distributions. MANs are thus a theoretically sound generalization of traditional adversarial networks that discriminate over two distributions. More specifically, for the MDTC task, MAN learns features that are invariant across multiple domains by resorting to its ability to reduce the divergence among the feature distributions of each domain. We present experimental results showing that MANs significantly outperform the prior art on the MDTC task. We also show that MANs achieve stateoftheart performance for domains with no labeled data.
 [45] arXiv:1802.05695 (crosslist from cs.CL) [pdf, other]

Title: Explainable Prediction of Medical Codes from Clinical TextComments: NAACL 2018Subjects: Computation and Language (cs.CL); Learning (cs.LG); Machine Learning (stat.ML)
Clinical notes are text documents that are created by clinicians for each patient encounter. They are typically accompanied by medical codes, which describe the diagnosis and treatment. Annotating these codes is labor intensive and error prone; furthermore, the connection between the codes and the text is not annotated, obscuring the reasons and details behind specific diagnoses and treatments. We present an attentional convolutional network that predicts medical codes from clinical text. Our method aggregates information across the document using a convolutional neural network, and uses an attention mechanism to select the most relevant segments for each of the thousands of possible codes. The method is accurate, achieving precision @ 8 of 0.7 and a MicroF1 of 0.52, which are both significantly better than the prior state of the art. Furthermore, through an interpretability evaluation by a physician, we show that the attention mechanism identifies meaningful explanations for each code assignment.
Replacements for Fri, 16 Feb 18
 [46] arXiv:1207.5895 (replaced) [pdf, ps, other]

Title: Social learning equilibriaSubjects: Statistics Theory (math.ST)
 [47] arXiv:1503.05436 (replaced) [pdf, other]

Title: Inference in Additively Separable Models With a HighDimensional Set of Conditioning VariablesAuthors: Damian KozburSubjects: Statistics Theory (math.ST)
 [48] arXiv:1604.04706 (replaced) [pdf, other]

Title: DSMLR: Exploiting Double Separability for Scaling up Distributed Multinomial Logistic RegressionAuthors: Parameswaran Raman, Sriram Srinivasan, Shin Matsushima, Xinhua Zhang, Hyokun Yun, S.V.N. VishwanathanSubjects: Learning (cs.LG); Machine Learning (stat.ML)
 [49] arXiv:1605.09232 (replaced) [pdf, ps, other]

Title: Tradeoffs between Convergence Speed and Reconstruction Accuracy in Inverse ProblemsComments: To appear in IEEE Transactions on Signal ProcessingSubjects: Numerical Analysis (cs.NA); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [50] arXiv:1702.05008 (replaced) [pdf, other]

Title: Tree Ensembles with Rule Structured Horseshoe RegularizationComments: 24 pages. R packageSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
 [51] arXiv:1703.03165 (replaced) [pdf, other]

Title: Perturbation Bootstrap in Adaptive LassoComments: 43 pages, 3 tables, 2 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
 [52] arXiv:1704.06176 (replaced) [pdf, other]

Title: Segmentation of the Proximal Femur from MR Images using Deep Convolutional Neural NetworksAuthors: Cem M. Deniz, Siyuan Xiang, Spencer Hallyburton, Arakua Welbeck, Stephen Honig, Kyunghyun Cho, Gregory ChangComments: 26 pages, 5 figures, and 2 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Machine Learning (stat.ML)
 [53] arXiv:1704.07949 (replaced) [pdf, other]

Title: Reconditioning your quantile functionAuthors: Keith PedersenComments: 11 pages, 3 figures, 2 algorithmsSubjects: Computation (stat.CO); Data Analysis, Statistics and Probability (physics.dataan)
 [54] arXiv:1705.06073 (replaced) [pdf, other]

Title: Superfast Line Spectral EstimationComments: 16 pages, 7 figures, accepted for IEEE Transactions on Signal ProcessingSubjects: Information Theory (cs.IT); Applications (stat.AP)
 [55] arXiv:1705.08415 (replaced) [pdf, other]

Title: Community Detection with Graph Neural NetworksSubjects: Machine Learning (stat.ML)
 [56] arXiv:1706.03471 (replaced) [pdf, other]

Title: YellowFin and the Art of Momentum TuningComments: Updated to reflect improved stability discussion and work for SysML presentationSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI)
 [57] arXiv:1706.06066 (replaced) [pdf, other]

Title: On Quadratic Convergence of DC Proximal Newton Algorithm for Nonconvex Sparse Learning in High DimensionsComments: 36 pages, 5 figures, 1 table, Accepted at NIPS 2017Subjects: Machine Learning (stat.ML); Learning (cs.LG); Optimization and Control (math.OC)
 [58] arXiv:1706.10295 (replaced) [pdf, other]

Title: Noisy Networks for ExplorationAuthors: Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane LeggComments: ICLR 2018Subjects: Learning (cs.LG); Machine Learning (stat.ML)
 [59] arXiv:1707.07113 (replaced) [pdf, other]

Title: Adversarial Variational Optimization of NonDifferentiable SimulatorsSubjects: Machine Learning (stat.ML); Learning (cs.LG)
 [60] arXiv:1708.00829 (replaced) [pdf, ps, other]

Title: Complexity Results for MCMC derived from Quantitative BoundsSubjects: Computation (stat.CO); Probability (math.PR)
 [61] arXiv:1709.06360 (replaced) [pdf, ps, other]

Title: Minimax lower bounds for function estimation on graphsSubjects: Statistics Theory (math.ST)
 [62] arXiv:1709.06853 (replaced) [pdf, other]

Title: Bandits with Delayed, Aggregated Anonymous FeedbackSubjects: Machine Learning (stat.ML); Learning (cs.LG)
 [63] arXiv:1709.10433 (replaced) [pdf, other]

Title: On the Capacity of Face RepresentationSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [64] arXiv:1710.06451 (replaced) [pdf, other]

Title: A Bayesian Perspective on Generalization and Stochastic Gradient DescentComments: 13 pages, 9 figures. Published as a conference paper at ICLR 2018Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [65] arXiv:1711.05360 (replaced) [pdf, other]

Title: The Dispersion BiasSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
 [66] arXiv:1712.01193 (replaced) [pdf, other]

Title: A dual framework for trace norm regularized lowrank tensor completionComments: Title changed from earlier version, a shorter version appeared in the NIPS workshop on Synergies in Geometric Data Analysis 2017Subjects: Learning (cs.LG); Machine Learning (stat.ML)
 [67] arXiv:1802.03653 (replaced) [pdf, ps, other]

Title: On Symplectic OptimizationComments: 20 pages, 5 figuresSubjects: Computation (stat.CO)
 [68] arXiv:1802.04784 (replaced) [pdf, ps, other]

Title: MONK  OutlierRobust Mean Embedding Estimation by MedianofMeansComments: 11 pagesSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Functional Analysis (math.FA); Statistics Theory (math.ST)
 [69] arXiv:1802.04826 (replaced) [pdf, other]

Title: Leveraging the Exact Likelihood of Deep Latent Variables ModelsSubjects: Machine Learning (stat.ML); Learning (cs.LG); Methodology (stat.ME)
 [70] arXiv:1802.04956 (replaced) [pdf, ps, other]

Title: D2KE: From Distance to Kernel and EmbeddingComments: 18 pages, 4 tablesSubjects: Machine Learning (stat.ML); Learning (cs.LG)
 [71] arXiv:1802.05141 (replaced) [pdf, other]

Title: Deep Learning and Data Assimilation for RealTime Production Prediction in Natural Gas WellsComments: Reduced length preprint submitted to IJCAI 2018 for reviewSubjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.fludyn); Geophysics (physics.geoph); Machine Learning (stat.ML)
 [72] arXiv:1802.05155 (replaced) [pdf, other]

Title: Toward Deeper Understanding of Nonconvex Stochastic Optimization with Momentum using Diffusion ApproximationsSubjects: Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
[ showing up to 2000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)