Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

Rodriguez, Sebastiaan Alvarez; Chakraborty, Jayjeet; Chu, Aaron; Jimenez, Ivo; LeFevre, Jeff; Maltzahn, Carlos; Uta, Alexandru

Full-text links:

Download:

Current browse context:

cs.DC

< prev | next >

new | recent | 2106

Change to browse by:

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

Authors: Sebastiaan Alvarez Rodriguez (1), Jayjeet Chakraborty (3), Aaron Chu (2), Ivo Jimenez (2), Jeff LeFevre (2), Carlos Maltzahn (2), Alexandru Uta (1) ((1) Leiden University, (2) UCSC Santa Cruz)

(Submitted on 24 Jun 2021 (this version), latest version 27 Nov 2021 (v2))

Abstract: Distributed data processing ecosystems are widespread and their components are highly specialized, such that efficient interoperability is urgent. Recently, Apache Arrow was chosen by the community to serve as a format mediator, providing efficient in-memory data representation. Arrow enables efficient data movement between data processing and storage engines, significantly improving interoperability and overall performance. In this work, we design a new zero-cost data interoperability layer between Apache Spark and Arrow-based data sources through the Arrow Dataset API. Our novel data interface helps separate the computation (Spark) and data (Arrow) layers. This enables practitioners to seamlessly use Spark to access data from all Arrow Dataset API-enabled data sources and frameworks. To benefit our community, we open-source our work and show that consuming data through Apache Arrow is zero-cost: our novel data interface is either on-par or more performant than native Spark.

Comments:	6 pages, 6 figures
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2106.13020 [cs.DC]
	(or arXiv:2106.13020v1 [cs.DC] for this version)

Submission history

From: Sebastiaan Alvarez Rodriguez [view email]
[v1] Thu, 24 Jun 2021 13:52:08 GMT (558kb,D)
[v2] Sat, 27 Nov 2021 09:40:29 GMT (630kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2106.13020v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

Submission history