We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Mapping Languages and Demographics with Georeferenced Corpora

Abstract: This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the datasets are most representative of actual populations; and (iii) how to weight the datasets to provide more accurate representations of underlying populations. The paper finds that the two datasets represent very different populations and that they correlate with actual populations with values of r=0.60 (social media) and r=0.49 (web-crawled). Further, Twitter data makes better predictions about the inventory of languages used in each country.
Comments: Proceedings of GeoComputation 19
Subjects: Computation and Language (cs.CL)
DOI: 10.17608/k6.auckland.9869252.v2
Cite as: arXiv:2004.00809 [cs.CL]
  (or arXiv:2004.00809v1 [cs.CL] for this version)

Submission history

From: Jonathan Dunn [view email]
[v1] Thu, 2 Apr 2020 04:34:11 GMT (2123kb,D)

Link back to: arXiv, form interface, contact.