We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.LG

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Machine Learning

Title: DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks

Abstract: Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. It is challenging due to large memory capacity and bandwidth requirements on a single compute node and high communication volumes across multiple nodes. In this paper, we present DistGNN that optimizes the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters via an efficient shared memory implementation, communication reduction using a minimum vertex-cut graph partitioning algorithm and communication avoidance using a family of delayed-update algorithms. Our results on four common GNN benchmark datasets: Reddit, OGB-Products, OGB-Papers and Proteins, show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets, respectively, over baseline DGL implementations running on a single CPU socket
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as: arXiv:2104.06700 [cs.LG]
  (or arXiv:2104.06700v1 [cs.LG] for this version)

Submission history

From: Md Vasimuddin [view email]
[v1] Wed, 14 Apr 2021 08:46:35 GMT (2506kb,D)
[v2] Thu, 15 Apr 2021 05:37:53 GMT (5051kb,D)
[v3] Fri, 16 Apr 2021 15:04:55 GMT (5051kb,D)

Link back to: arXiv, form interface, contact.