We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Asynchronous Execution of Heterogeneous Tasks in ML-driven HPC Workflows

Abstract: Heterogeneous scientific workflows consist of numerous types of tasks that require executing on heterogeneous resources. Asynchronous execution of those tasks is crucial to improve resource utilization, task throughput and reduce workflows' makespan. Therefore, middleware capable of scheduling and executing different task types across heterogeneous resources must enable asynchronous execution of tasks. In this paper, we investigate the requirements and properties of the asynchronous task execution of machine learning (ML)-driven high performance computing (HPC) workflows. We model the degree of asynchronicity permitted for arbitrary workflows and propose key metrics that can be used to determine qualitative benefits when employing asynchronous execution. Our experiments represent relevant scientific drivers, we perform them at scale on Summit, and we show that the performance enhancements due to asynchronous execution are consistent with our model.
Comments: Publised on 26th edition of the workshop on Job Scheduling Strategies for Parallel Processing. JSSPP23
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as: arXiv:2208.11069 [cs.DC]
  (or arXiv:2208.11069v2 [cs.DC] for this version)

Submission history

From: Ozgur Ozan Kilic Ph.D. [view email]
[v1] Tue, 23 Aug 2022 16:25:48 GMT (3918kb,D)
[v2] Tue, 27 Jun 2023 16:13:22 GMT (4121kb,D)

Link back to: arXiv, form interface, contact.