Distributed Training Large-Scale Deep Architectures

Zou, Shang-Xuan; Chen, Chun-Yen; Wu, Jui-Lin; Chou, Chun-Nan; Tsao, Chia-Chin; Tung, Kuan-Chieh; Lin, Ting-Wei; Sung, Cheng-Lung; Chang, Edward Y.

Full-text links:

Download:

Current browse context:

cs.DC

< prev | next >

new | recent | 1709

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Distributed Training Large-Scale Deep Architectures

Authors: Shang-Xuan Zou, Chun-Yen Chen, Jui-Lin Wu, Chun-Nan Chou, Chia-Chin Tsao, Kuan-Chieh Tung, Ting-Wei Lin, Cheng-Lung Sung, Edward Y. Chang

(Submitted on 10 Aug 2017)

Abstract: Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Via lessons learned from our routine benchmarking effort, we first identify bottlenecks and overheads that hinter data parallelism. We then devise guidelines that help practitioners to configure an effective system and fine-tune parameters to achieve desired speedup. Specifically, we develop a procedure for setting minibatch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1709.06622 [cs.DC]
	(or arXiv:1709.06622v1 [cs.DC] for this version)

Submission history

From: Chun-Nan Chou [view email]
[v1] Thu, 10 Aug 2017 09:24:27 GMT (542kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:1709.06622

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Distributed Training Large-Scale Deep Architectures

Submission history