Secure Distributed Training at Scale

Gorbunov, Eduard; Borzunov, Alexander; Diskin, Michael; Ryabinin, Max

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2106

Computer Science > Machine Learning

Title: Secure Distributed Training at Scale

Authors: Eduard Gorbunov, Alexander Borzunov, Michael Diskin, Max Ryabinin

(Submitted on 21 Jun 2021 (v1), revised 7 Oct 2021 (this version, v2), latest version 2 Jan 2023 (v4))

Abstract: Some of the hardest problems in deep learning can be solved via pooling together computational resources of many independent parties, as is the case for scientific collaborations and volunteer computing. Unfortunately, any single participant in such systems can jeopardize the entire training run by sending incorrect updates, whether deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server. As a result, it can be infeasible to apply such algorithms to large-scale distributed deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency. We rigorously analyze this protocol: in particular, we provide theoretical bounds for its resistance against Byzantine and Sybil attacks and show that it has a marginal communication overhead. To demonstrate its practical effectiveness, we conduct large-scale experiments on image classification and language modeling in presence of Byzantine attackers.

Comments:	62 pages, 8 figures. Code: this https URL v2 has slightly more general assumptions, contains additional clarifications on them, extra experiments, improved discussion of the algorithms and the related work, and corrected typos
Subjects:	Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
Cite as:	arXiv:2106.11257 [cs.LG]
	(or arXiv:2106.11257v2 [cs.LG] for this version)

Submission history

From: Alexander Borzunov [view email]
[v1] Mon, 21 Jun 2021 17:00:42 GMT (1916kb,D)
[v2] Thu, 7 Oct 2021 15:31:02 GMT (2107kb,D)
[v3] Tue, 28 Jun 2022 15:58:24 GMT (2715kb,D)
[v4] Mon, 2 Jan 2023 03:24:04 GMT (2715kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2106.11257v2

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Secure Distributed Training at Scale

Submission history