Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training

Sun, Jie; Su, Li; Shi, Zuocheng; Shen, Wenting; Wang, Zeke; Wang, Lei; Zhang, Jie; Li, Yong; Yu, Wenyuan; Zhou, Jingren; Wu, Fei

Full-text links:

Download:

Current browse context:

cs.DC

< prev | next >

new | recent | 2305

Change to browse by:

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training

Authors: Jie Sun, Li Su, Zuocheng Shi, Wenting Shen, Zeke Wang, Lei Wang, Jie Zhang, Yong Li, Wenyuan Yu, Jingren Zhou, Fei Wu

(Submitted on 26 May 2023 (v1), last revised 12 Jun 2023 (this version, v2))

Abstract: Graph neural network(GNN) has been widely applied in real-world applications, such as product recommendation in e-commerce platforms and risk control in financial management systems. Several cache-based GNN systems have been built to accelerate GNN training in a single machine with multiple GPUs. However, these systems fail to train billion-scale graphs efficiently, which is a common challenge in the industry. In this work, we propose Legion, a system that automatically pushes the envelope of multi-GPU systems for accelerating billion-scale GNN training. First, we design a hierarchical graph partitioning mechanism that significantly improves the multi-GPU cache performance. Second, we build a unified multi-GPU cache that helps to minimize the PCIe traffic incurred by caching both graph topology and features with the highest hotness. Third, we develop an automatic caching management mechanism that adapts the multi-GPU cache plan according to the hardware specifications and various graphs to maximize the overall training throughput. Evaluations on various GNN models and multiple datasets show that Legion supports training billion-scale GNNs in a single machine and significantly outperforms the state-of-the-art cache-based systems on small graphs.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2305.16588 [cs.DC]
	(or arXiv:2305.16588v2 [cs.DC] for this version)

Submission history

From: Jie Sun [view email]
[v1] Fri, 26 May 2023 02:30:38 GMT (2737kb,D)
[v2] Mon, 12 Jun 2023 08:25:09 GMT (7825kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2305.16588

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training

Submission history