We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Data Structures and Algorithms

Title: R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space

Abstract: Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in $O(n \log \log (n/r))$ time and with $O(r \log n)$ bits of working space for string length $n$ and number $r$ of runs in RLBWT, where $r$ is expected to be significantly smaller than $n$ for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes.
Comments: The content of the paper is significantly different from the previous version
Subjects: Data Structures and Algorithms (cs.DS)
Cite as: arXiv:2004.01493 [cs.DS]
  (or arXiv:2004.01493v4 [cs.DS] for this version)

Submission history

From: Takaaki Nishimoto [view email]
[v1] Fri, 3 Apr 2020 12:12:01 GMT (340kb,D)
[v2] Fri, 1 May 2020 07:22:16 GMT (461kb,D)
[v3] Fri, 15 May 2020 08:32:48 GMT (455kb,D)
[v4] Tue, 2 Mar 2021 06:31:55 GMT (785kb,D)

Link back to: arXiv, form interface, contact.