Current browse context:
cs.LG
Change to browse by:
References & Citations
Computer Science > Machine Learning
Title: ResIST: Layer-Wise Decomposition of ResNets for Distributed Training
(Submitted on 2 Jul 2021 (v1), last revised 14 Mar 2022 (this version, v2))
Abstract: We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats until convergence. By construction, per iteration, ResIST communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, ResIST reduces the per-iteration communication, memory, and time requirements of ResNet training to only a fraction of the requirements of full-model training. In comparison to common protocols, like data-parallel training and data-parallel training with local SGD, ResIST yields a decrease in communication and compute requirements, while being competitive with respect to model performance.
Submission history
From: Cameron R. Wolfe [view email][v1] Fri, 2 Jul 2021 10:48:50 GMT (1040kb,D)
[v2] Mon, 14 Mar 2022 14:21:25 GMT (1246kb,D)
Link back to: arXiv, form interface, contact.