We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Cem Mil Podcasts: A Spoken Portuguese Document Corpus For Multi-modal, Multi-lingual and Multi-Dialect Information Access Research

Abstract: In this paper we describe the Portuguese-language podcast dataset we have released for academic research purposes. We give an overview of how the data was sampled, descriptive statistics over the collection, as well as information about the distribution over Brazilian and Portuguese dialects. We give results from experiments on multi-lingual summarization, showing that summarizing podcast transcripts can be performed well by a system supporting both English and Portuguese. We also show experiments on Portuguese podcast genre classification using text metadata. Combining this collection with previously released English-language collection opens up the potential for multi-modal, multi-lingual and multi-dialect podcast information access research.
Comments: 12 pages, 1 figure
Subjects: Computation and Language (cs.CL)
Journal reference: Volume 14163 of Lecture Notes in Computer Science, pages 48-59, Springer, 2023
DOI: 10.1007/978-3-031-42448-9_5
Cite as: arXiv:2209.11871 [cs.CL]
  (or arXiv:2209.11871v2 [cs.CL] for this version)

Submission history

From: Edgar Tanaka Mr [view email]
[v1] Fri, 23 Sep 2022 21:41:10 GMT (803kb)
[v2] Wed, 13 Dec 2023 14:39:29 GMT (828kb)

Link back to: arXiv, form interface, contact.