We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Web-scale Topic Models in Spark: An Asynchronous Parameter Server

Abstract: In this paper, we train a Latent Dirichlet Allocation (LDA) topic model on the ClueWeb12 data set, a 27-terabyte Web crawl. We extend Spark, a popular framework for performing large-scale data analysis, with an asynchronous parameter server. Such a parameter server provides a distributed and concurrently accessed parameter space for the model. A Metropolis-Hastings based collapsed Gibbs sampler is implemented using this parameter server achieving an amortized O(1) sampling complexity. We compare our implementation to the default Spark implementations and show that it is significantly faster and more scalable without sacrificing model quality. A topic model with 1,000 topics is trained on the full ClueWeb12 data set, uncovering some of the prevalent themes that appear on the Web.
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as: arXiv:1605.07422 [cs.DC]
  (or arXiv:1605.07422v1 [cs.DC] for this version)

Submission history

From: Rolf Jagerman [view email]
[v1] Tue, 24 May 2016 12:40:29 GMT (1467kb,D)
[v2] Fri, 17 Jun 2016 08:43:56 GMT (1467kb,D)
[v3] Sun, 18 Jun 2017 22:37:23 GMT (3235kb,D)

Link back to: arXiv, form interface, contact.