References & Citations
Computer Science > Data Structures and Algorithms
Title: R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space
(Submitted on 3 Apr 2020 (v1), last revised 2 Mar 2021 (this version, v4))
Abstract: Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in $O(n \log \log (n/r))$ time and with $O(r \log n)$ bits of working space for string length $n$ and number $r$ of runs in RLBWT, where $r$ is expected to be significantly smaller than $n$ for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes.
Submission history
From: Takaaki Nishimoto [view email][v1] Fri, 3 Apr 2020 12:12:01 GMT (340kb,D)
[v2] Fri, 1 May 2020 07:22:16 GMT (461kb,D)
[v3] Fri, 15 May 2020 08:32:48 GMT (455kb,D)
[v4] Tue, 2 Mar 2021 06:31:55 GMT (785kb,D)
Link back to: arXiv, form interface, contact.