### References & Citations

# Computer Science > Data Structures and Algorithms

# Title: Optimal Construction of Compressed Indexes for Highly Repetitive Texts

(Submitted on 13 Dec 2017 (v1), last revised 20 May 2019 (this version, v8))

Abstract: We propose algorithms that, given the input string of length $n$ over integer alphabet of size $\sigma$, construct the Burrows-Wheeler transform (BWT), the permuted longest-common-prefix (PLCP) array, and the LZ77 parsing in $O(n/\log_{\sigma}n+r\,{\rm polylog}\,n)$ time and working space, where $r$ is the number of runs in the BWT of the input. These are the essential components of many compressed indexes such as compressed suffix tree, FM-index, and grammar and LZ77-based indexes, but also find numerous applications in sequence analysis and data compression. The value of $r$ is a common measure of repetitiveness that is significantly smaller than $n$ if the string is highly repetitive. Since just accessing every symbol of the string requires $\Omega(n/\log_{\sigma}n)$ time, the presented algorithms are time and space optimal for inputs satisfying the assumption $n/r\in\Omega({\rm polylog}\,n)$ on the repetitiveness. For such inputs our result improves upon the currently fastest general algorithms of Belazzougui (STOC 2014) and Munro et al. (SODA 2017) which run in $O(n)$ time and use $O(n/\log_{\sigma} n)$ working space. We also show how to use our techniques to obtain optimal solutions on highly repetitive data for other fundamental string processing problems such as: Lyndon factorization, construction of run-length compressed suffix arrays, and some classical "textbook" problems such as computing the longest substring occurring at least some fixed number of times.

## Submission history

From: Dominik Kempa [view email]**[v1]**Wed, 13 Dec 2017 17:56:24 GMT (16kb)

**[v2]**Fri, 22 Dec 2017 13:05:52 GMT (19kb)

**[v3]**Sat, 27 Jan 2018 21:30:19 GMT (21kb)

**[v4]**Sat, 17 Mar 2018 23:40:12 GMT (21kb)

**[v5]**Mon, 9 Apr 2018 15:51:48 GMT (29kb)

**[v6]**Sat, 21 Apr 2018 19:20:09 GMT (30kb)

**[v7]**Fri, 17 May 2019 15:38:16 GMT (28kb)

**[v8]**Mon, 20 May 2019 01:00:47 GMT (28kb)

Link back to: arXiv, form interface, contact.