ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data

Nguyen, Duy Thanh; Je, Hyeonseung; Nguyen, Tuan Nghia; Ryu, Soojung; Lee, Kyujoong; Lee, Hyuk-Jae

doi:10.1109/TCSI.2022.3153288

Full-text links:

Download:

PDF only

Current browse context:

cs.DC

< prev | next >

new | recent | 2106

Computer Science > Distributed, Parallel, and Cluster Computing

Title: ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data

Authors: Duy Thanh Nguyen, Hyeonseung Je, Tuan Nghia Nguyen, Soojung Ryu, Kyujoong Lee, Hyuk-Jae Lee

(Submitted on 15 Jun 2021 (v1), last revised 13 Feb 2022 (this version, v4))

Abstract: Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152 [8]. Most of the previous DNN compilers, accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card 2.8x faster and 9.9x more power efficient than NVIDIA RTX 2080 Ti for 256x256 input size. . Compared to the result from baseline, in which the weights, inputs, and outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8], which also mine the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.

Comments:	Accepted for publication in IEEE Transaction on Circuits and Systems I: Regular Papers
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR)
DOI:	10.1109/TCSI.2022.3153288
Cite as:	arXiv:2106.08167 [cs.DC]
	(or arXiv:2106.08167v4 [cs.DC] for this version)

Submission history

From: Duy Thanh Nguyen [view email]
[v1] Tue, 15 Jun 2021 14:10:10 GMT (1831kb)
[v2] Wed, 16 Jun 2021 10:59:04 GMT (1829kb)
[v3] Sat, 25 Dec 2021 07:03:26 GMT (1832kb)
[v4] Sun, 13 Feb 2022 12:57:13 GMT (1900kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2106.08167v4

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Distributed, Parallel, and Cluster Computing

Title: ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data

Submission history