We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

math.ST

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Mathematics > Statistics Theory

Title: A statistical test for correspondence of texts to the Zipf-Mandelbrot law

Abstract: We analyse correspondence of a text to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary. The probability distribution correspond to the Zipf---Mandelbrot law. We count sequentially the numbers of different words in the text and get the process of the numbers of different words. Then we estimate Zipf---Mandelbrot law parameters using the same sequence and construct an estimate of the expectation of the number of different words in the text. Then we subtract the corresponding values of the estimate from the sequence and normalize along the coordinate axes, obtaining a random process on a segment from 0 to 1. We prove that this process (the empirical text bridge) converges weakly in the uniform metric on $C (0,1)$ to a centered Gaussian process with continuous a.s. paths. We develop and implement an algorithm for approximate calculation of eigenvalues of the covariance function of the limit Gaussian process, and then an algorithm for calculating the probability distribution of the integral of the square of this process. We use the algorithm to analyze uniformity of texts in English, French, Russian and Chinese.
Subjects: Statistics Theory (math.ST); Computation and Language (cs.CL)
Cite as: arXiv:1912.11600 [math.ST]
  (or arXiv:1912.11600v1 [math.ST] for this version)

Submission history

From: Artyom Kovalevskii [view email]
[v1] Wed, 25 Dec 2019 05:59:29 GMT (825kb,D)

Link back to: arXiv, form interface, contact.