We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DB

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Databases

Title: Parallel Evaluation of Multi-Semi-Joins

Abstract: While services such as Amazon AWS make computing power abundantly available, adding more computing nodes can incur high costs in, for instance, pay-as-you-go plans while not always significantly improving the net running time (aka wall-clock time) of queries. In this work, we provide algorithms for parallel evaluation of SGF queries in MapReduce that optimize total time, while retaining low net time. Not only can SGF queries specify all semi-join reducers, but also more expressive queries involving disjunction and negation. Since SGF queries can be seen as Boolean combinations of (potentially nested) semi-joins, we introduce a novel multi-semi-join (MSJ) MapReduce operator that enables the evaluation of a set of semi-joins in one job. We use this operator to obtain parallel query plans for SGF queries that outvalue sequential plans w.r.t. net time and provide additional optimizations aimed at minimizing total time without severely affecting net time. Even though the latter optimizations are NP-hard, we present effective greedy algorithms. Our experiments, conducted using our own implementation Gumbo on top of Hadoop, confirm the usefulness of parallel query plans, and the effectiveness and scalability of our optimizations, all with a significant improvement over Pig and Hive.
Comments: added Gumbo code reference, added Subset Sum reference, adjusted alignment in Figure 1, adjusted Figure 5 (remove redundant units, larger font), removed capitals in Table 2, boxes for environment ends, clarified proof in appendix, reference cleanup (pages, capitalization), uncapitalized "REQUEST" and "ASSERT" when used in text, small rewordings (no results affected)
Subjects: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as: arXiv:1605.05219 [cs.DB]
  (or arXiv:1605.05219v2 [cs.DB] for this version)

Submission history

From: Jonny Daenen [view email]
[v1] Tue, 17 May 2016 15:53:27 GMT (894kb,D)
[v2] Sun, 22 May 2016 11:22:44 GMT (895kb,D)

Link back to: arXiv, form interface, contact.