We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

math.PR

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Mathematics > Probability

Title: Spectral Analysis of Word Statistics

Abstract: Given a random text over a finite alphabet, we study the frequencies at which fixed-length words occur as subsequences. As the data size grows, the joint distribution of word counts exhibits a rich asymptotic structure. We investigate all linear combinations of subword statistics, and fully characterize their different orders of magnitude using diverse algebraic tools.
Moreover, we establish the spectral decomposition of the space of word statistics of each order. We provide explicit formulas for the eigenvectors and eigenvalues of the covariance matrix of the multivariate distribution of these statistics. Our techniques include and elaborate on a set of algebraic word operators, recently studied and employed by Dieker and Saliola (Adv Math, 2018).
Subword counts find applications in Combinatorics, Statistics, and Computer Science. We revisit special cases from the combinatorial literature, such as intransitive dice, random core partitions, and questions on random walk. Our structural approach describes in a unified framework several classical statistical tests. We propose further potential applications to data analysis and machine learning.
Subjects: Probability (math.PR); Combinatorics (math.CO); Statistics Theory (math.ST)
Cite as: arXiv:2012.00742 [math.PR]
  (or arXiv:2012.00742v2 [math.PR] for this version)

Submission history

From: Chaim Even-Zohar [view email]
[v1] Tue, 1 Dec 2020 18:59:40 GMT (67kb)
[v2] Thu, 3 Dec 2020 16:29:12 GMT (67kb)

Link back to: arXiv, form interface, contact.