We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Abstract: For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly diverse workloads make such manual process challenging, time-consuming, and error-prone. We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Journal reference: Software Impacts 2021
Cite as: arXiv:2105.07526 [cs.DC]
  (or arXiv:2105.07526v1 [cs.DC] for this version)

Submission history

From: Yuping Fan [view email]
[v1] Sun, 16 May 2021 21:56:31 GMT (18843kb,D)

Link back to: arXiv, form interface, contact.