We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: TuGeBiC: A Turkish German Bilingual Code-Switching Corpus

Abstract: In this paper we describe the process of collection, transcription, and annotation of recordings of spontaneous speech samples from Turkish-German bilinguals, and the compilation of a corpus called TuGeBiC. Participants in the study were adult Turkish-German bilinguals living in Germany or Turkey at the time of recording in the first half of the 1990s. The data were manually tokenised and normalised, and all proper names (names of participants and places mentioned in the conversations) were replaced with pseudonyms. Token-level automatic language identification was performed, which made it possible to establish the proportions of words from each language. The corpus is roughly balanced between both languages. We also present quantitative information about the number of code-switches, and give examples of different types of code-switching found in the data. The resulting corpus has been made freely available to the research community.
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2205.00868 [cs.CL]
  (or arXiv:2205.00868v1 [cs.CL] for this version)

Submission history

From: Jeanine Treffers-Daller Professor [view email]
[v1] Mon, 2 May 2022 12:53:05 GMT (37kb)

Link back to: arXiv, form interface, contact.