Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

Bader, Jonathan; Thamsen, Lauritz; Kulagina, Svetlana; Will, Jonathan; Meyerhenke, Henning; Kao, Odej

doi:10.1109/BigData52589.2021.9671519

Full-text links:

Download:

Current browse context:

cs.DC

< prev | next >

new | recent | 2111

Change to browse by:

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

Authors: Jonathan Bader, Lauritz Thamsen, Svetlana Kulagina, Jonathan Will, Henning Meyerhenke, Odej Kao

(Submitted on 9 Nov 2021 (v1), last revised 19 Jan 2022 (this version, v2))

Abstract: Scientific workflow management systems like Nextflow support large-scale data analysis by abstracting away the details of scientific workflows. In these systems, workflows consist of several abstract tasks, of which instances are run in parallel and transform input partitions into output partitions. Resource managers like Kubernetes execute such workflow tasks on cluster infrastructures. However, these resource managers only consider the number of CPUs and the amount of available memory when assigning tasks to resources; they do not consider hardware differences beyond these numbers, while computational speed and memory access rates can differ significantly.
We propose Tarema, a system for allocating task instances to heterogeneous cluster resources during the execution of scalable scientific workflows. First, Tarema profiles the available infrastructure with a set of benchmark programs and groups cluster nodes with similar performance. Second, Tarema uses online monitoring data of tasks, assigning labels to tasks depending on their resource usage. Third, Tarema uses the node groups and task labels to dynamically assign task instances evenly to resources based on resource demand. Our evaluation of a prototype implementation for Kubernetes, using five real-world Nextflow workflows from the popular nf-core framework and two 15-node clusters consisting of different virtual machines, shows a mean reduction of isolated job runtimes by 19.8% compared to popular schedulers in widely-used resource managers and 4.54% compared to the heuristic SJFN, while providing a better cluster usage. Moreover, executing two long-running workflows in parallel and on restricted resources shows that Tarema is able to reduce the runtimes even more while providing a fair cluster usage.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Journal reference:	IEEE Big Data (2021), 65-75
DOI:	10.1109/BigData52589.2021.9671519
Cite as:	arXiv:2111.05167 [cs.DC]
	(or arXiv:2111.05167v2 [cs.DC] for this version)

Submission history

From: Jonathan Bader [view email]
[v1] Tue, 9 Nov 2021 14:26:53 GMT (1217kb,D)
[v2] Wed, 19 Jan 2022 10:40:53 GMT (315kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2111.05167

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

Submission history