We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Crosslingual Topic Modeling with WikiPDA

Abstract: We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation shows that WikiPDA produces more coherent topics than monolingual text-based LDA, thus offering crosslinguality at no cost. We demonstrate WikiPDA's utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification. Finally, we highlight WikiPDA's capacity for zero-shot language transfer, where a model is reused for new languages without any fine-tuning. Researchers can benefit from WikiPDA as a practical tool for studying Wikipedia's content across its 299 language editions in interpretable ways, via an easy-to-use library publicly available at this https URL
Comments: 10 pages, WWW - The Web Conference, 2021
Subjects: Computation and Language (cs.CL); Digital Libraries (cs.DL)
DOI: 10.1145/3442381.3449805
Cite as: arXiv:2009.11207 [cs.CL]
  (or arXiv:2009.11207v2 [cs.CL] for this version)

Submission history

From: Tiziano Piccardi [view email]
[v1] Wed, 23 Sep 2020 15:19:27 GMT (8111kb,D)
[v2] Sun, 14 Feb 2021 13:28:18 GMT (999kb,D)

Link back to: arXiv, form interface, contact.