We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Information Retrieval

Title: Topic Modeling on Podcast Short-Text Metadata

Abstract: Podcasts have emerged as a massively consumed online content, notably due to wider accessibility of production means and scaled distribution through large streaming platforms. Categorization systems and information access technologies typically use topics as the primary way to organize or navigate podcast collections. However, annotating podcasts with topics is still quite problematic because the assigned editorial genres are broad, heterogeneous or misleading, or because of data challenges (e.g. short metadata text, noisy transcripts). Here, we assess the feasibility to discover relevant topics from podcast metadata, titles and descriptions, using topic modeling techniques for short text. We also propose a new strategy to leverage named entities (NEs), often present in podcast metadata, in a Non-negative Matrix Factorization (NMF) topic modeling framework. Our experiments on two existing datasets from Spotify and iTunes and Deezer, a new dataset from an online service providing a catalog of podcasts, show that our proposed document representation, NEiCE, leads to improved topic coherence over the baselines. We release the code for experimental reproducibility of the results.
Comments: Accepted for publication in the 44nd European Conference on Information Retrieval (ECIR'22)
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Cite as: arXiv:2201.04419 [cs.IR]
  (or arXiv:2201.04419v1 [cs.IR] for this version)

Submission history

From: Elena V. Epure [view email]
[v1] Wed, 12 Jan 2022 11:07:05 GMT (42kb)

Link back to: arXiv, form interface, contact.