We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DB

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Databases

Title: Recursive Programs for Document Spanners

Abstract: A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are obtained by adding capture variables to regular expressions. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (extracting relations that play the role of EDBs from the input document). In this paper, we investigate the expressive power of recursive Datalog over regex formulas. Our main result is that such programs capture precisely the document spanners computable in polynomial time. Additional results compare recursive programs to known formalisms such as the language of core spanners (that extends regular spanners by allowing to test for string equality) and its closure under difference. Finally, we extend our main result to a recently proposed framework that generalizes both the relational model and document spanners.
Subjects: Databases (cs.DB)
Cite as: arXiv:1712.08198 [cs.DB]
  (or arXiv:1712.08198v3 [cs.DB] for this version)

Submission history

From: Liat Peterfreund [view email]
[v1] Thu, 21 Dec 2017 20:22:47 GMT (53kb)
[v2] Tue, 24 Apr 2018 07:38:23 GMT (91kb,D)
[v3] Wed, 23 May 2018 05:13:34 GMT (91kb,D)

Link back to: arXiv, form interface, contact.