DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Fan, Yuping; Lan, Zhiling

Full-text links:

Download:

Current browse context:

cs.DC

< prev | next >

new | recent | 2105

Computer Science > Distributed, Parallel, and Cluster Computing

Title: DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Authors: Yuping Fan, Zhiling Lan

(Submitted on 16 May 2021)

Abstract: For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly diverse workloads make such manual process challenging, time-consuming, and error-prone. We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Journal reference:	Software Impacts 2021
Cite as:	arXiv:2105.07526 [cs.DC]
	(or arXiv:2105.07526v1 [cs.DC] for this version)

Submission history

From: Yuping Fan [view email]
[v1] Sun, 16 May 2021 21:56:31 GMT (18843kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2105.07526

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Distributed, Parallel, and Cluster Computing

Title: DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Submission history