We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

Abstract: Scientific workflow management systems like Nextflow support large-scale data analysis by abstracting away the details of scientific workflows. In these systems, workflows consist of several abstract tasks, of which instances are run in parallel and transform input partitions into output partitions. Resource managers like Kubernetes execute such workflow tasks on cluster infrastructures. However, these resource managers only consider the number of CPUs and the amount of available memory when assigning tasks to resources; they do not consider hardware differences beyond these numbers, while computational speed and memory access rates can differ significantly.
We propose Tarema, a system for allocating task instances to heterogeneous cluster resources during the execution of scalable scientific workflows. First, Tarema profiles the available infrastructure with a set of benchmark programs and groups cluster nodes with similar performance. Second, Tarema uses online monitoring data of tasks, assigning labels to tasks depending on their resource usage. Third, Tarema uses the node groups and task labels to dynamically assign task instances evenly to resources based on resource demand. Our evaluation of a prototype implementation for Kubernetes, using five real-world Nextflow workflows from the popular nf-core framework and two 15-node clusters consisting of different virtual machines, shows a mean reduction of isolated job runtimes by 19.8% compared to popular schedulers in widely-used resource managers and 4.54% compared to the heuristic SJFN, while providing a better cluster usage. Moreover, executing two long-running workflows in parallel and on restricted resources shows that Tarema is able to reduce the runtimes even more while providing a fair cluster usage.
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Journal reference: IEEE Big Data (2021), 65-75
DOI: 10.1109/BigData52589.2021.9671519
Cite as: arXiv:2111.05167 [cs.DC]
  (or arXiv:2111.05167v2 [cs.DC] for this version)

Submission history

From: Jonathan Bader [view email]
[v1] Tue, 9 Nov 2021 14:26:53 GMT (1217kb,D)
[v2] Wed, 19 Jan 2022 10:40:53 GMT (315kb,D)

Link back to: arXiv, form interface, contact.