We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

q-bio.GN

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Quantitative Biology > Genomics

Title: Efficient construction of the extended BWT from grammar-compressed DNA sequencing reads

Abstract: We present an algorithm for building the extended BWT (eBWT) of a string collection from its grammar-compressed representation. Our technique exploits the string repetitions captured by the grammar to boost the computation of the eBWT. Thus, the more repetitive the collection is, the lower are the resources we use per input symbol. We rely on a new grammar recently proposed at DCC'21 whose nonterminals serve as building blocks for inducing the eBWT. A relevant application for this idea is the construction of self-indexes for analyzing sequencing reads -- massive and repetitive string collections of raw genomic data. Self-indexes have become increasingly popular in Bioinformatics as they can encode more information in less space. Our efficient eBWT construction opens the door to perform accurate bioinformatic analyses on more massive sequence datasets, which are not tractable with current eBWT construction techniques.
Subjects: Genomics (q-bio.GN); Data Structures and Algorithms (cs.DS)
Cite as: arXiv:2102.03961 [q-bio.GN]
  (or arXiv:2102.03961v1 [q-bio.GN] for this version)

Submission history

From: Diego Díaz-Domínguez [view email]
[v1] Mon, 8 Feb 2021 02:10:34 GMT (949kb,D)

Link back to: arXiv, form interface, contact.