LightLDA: Big Topic Models on Modest Compute Clusters

Yuan, Jinhui; Gao, Fei; Ho, Qirong; Dai, Wei; Wei, Jinliang; Zheng, Xun; Xing, Eric P.; Liu, Tie-Yan; Ma, Wei-Ying

Full-text links:

Download:

Current browse context:

stat.ML

< prev | next >

new | recent | 1412

Statistics > Machine Learning

Title: LightLDA: Big Topic Models on Modest Compute Clusters

Authors: Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric P. Xing, Tie-Yan Liu, Wei-Ying Ma

(Submitted on 4 Dec 2014)

Abstract: When building large-scale machine learning (ML) programs, such as big topic models or deep neural nets, one usually assumes such tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners or academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens -- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude faster than current state-of-the-art Gibbs samplers; 2) a structure-aware model-parallel scheme, which leverages dependencies within the topic model, yielding a sampling strategy that is frugal on machine memory and network communication; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed; and 4) a bounded asynchronous data-parallel scheme, which allows efficient distributed processing of massive data via a parameter server. Our distribution strategy is an instance of the model-and-data-parallel programming model underlying the Petuum framework for general distributed ML, and was implemented on top of the Petuum open-source system. We provide experimental evidence showing how this development puts massive models within reach on a small cluster while still enjoying proportional time cost reductions with increasing cluster size, in comparison with alternative options.

Subjects:	Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:1412.1576 [stat.ML]
	(or arXiv:1412.1576v1 [stat.ML] for this version)

Submission history

From: Xun Zheng [view email]
[v1] Thu, 4 Dec 2014 07:49:12 GMT (600kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> stat > arXiv:1412.1576

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Statistics > Machine Learning

Title: LightLDA: Big Topic Models on Modest Compute Clusters

Submission history