We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

q-bio.GN

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Quantitative Biology > Genomics

Title: GapPredict: A Language Model for Resolving Gaps in Draft Genome Assemblies

Abstract: Short-read DNA sequencing instruments can yield over 1e+12 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding assembled sequences using paired-end reads. However, unresolved sequences in these scaffolds appear as "gaps". Here, we introduce GapPredict, a tool that uses a character-level language model to predict unresolved nucleotides in scaffold gaps. We benchmarked GapPredict against the state-of-the-art gap-filling tool Sealer, and observed that the former can fill 65.6% of the sampled gaps that were left unfilled by the latter, demonstrating the practical utility of deep learning approaches to the gap-filling problem in genome sequence assembly.
Comments: 9 pages, 7 figures. IEEE/ACM Trans Comput Biol Bioinform (2021)
Subjects: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
DOI: 10.1109/TCBB.2021.3109557
Cite as: arXiv:2105.10552 [q-bio.GN]
  (or arXiv:2105.10552v2 [q-bio.GN] for this version)

Submission history

From: Rene Warren [view email]
[v1] Fri, 21 May 2021 19:54:41 GMT (1044kb)
[v2] Tue, 25 May 2021 00:55:42 GMT (1044kb)

Link back to: arXiv, form interface, contact.