References & Citations
Computer Science > Distributed, Parallel, and Cluster Computing
Title: Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters
(Submitted on 3 Sep 2021 (v1), last revised 6 Sep 2021 (this version, v2))
Abstract: Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
Submission history
From: Qinghao Hu [view email][v1] Fri, 3 Sep 2021 05:02:52 GMT (620kb,D)
[v2] Mon, 6 Sep 2021 01:26:38 GMT (311kb,D)
Link back to: arXiv, form interface, contact.