Universal and non-universal text statistics: Clustering coefficient for language identification

Espitia, Diego; Larralde, Hernán

doi:10.1016/j.physa.2019.123905

Full-text links:

Download:

Current browse context:

physics.soc-ph

< prev | next >

new | recent | 1911

Physics > Physics and Society

Title: Universal and non-universal text statistics: Clustering coefficient for language identification

Authors: Diego Espitia, Hernán Larralde

(Submitted on 18 Nov 2019 (v1), last revised 7 Dec 2019 (this version, v2))

Abstract: In this work we analyze statistical properties of 91 relatively small texts in 7 different languages (Spanish, English, French, German, Turkish, Russian, Icelandic) as well as texts with randomly inserted spaces. Despite the size (around 11260 different words), the well known universal statistical laws -- namely Zipf and Herdan-Heap's laws -- are confirmed, and are in close agreement with results obtained elsewhere. We also construct a word co-occurrence network of each text. While the degree distribution is again universal, we note that the distribution of Clustering Coefficients, which depend strongly on the local structure of networks, can be used to differentiate between languages, as well as to distinguish natural languages from random texts.

Comments:	15 pages, 6 figures
Subjects:	Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
DOI:	10.1016/j.physa.2019.123905
Cite as:	arXiv:1911.08915 [physics.soc-ph]
	(or arXiv:1911.08915v2 [physics.soc-ph] for this version)

Submission history

From: Diego Espitia [view email]
[v1] Mon, 18 Nov 2019 21:39:19 GMT (350kb)
[v2] Sat, 7 Dec 2019 01:26:11 GMT (1961kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> physics > arXiv:1911.08915

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Physics > Physics and Society

Title: Universal and non-universal text statistics: Clustering coefficient for language identification

Submission history