EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Tambe, Thierry; Hooper, Coleman; Pentecost, Lillian; Jia, Tianyu; Yang, En-Yu; Donato, Marco; Sanh, Victor; Whatmough, Paul N.; Rush, Alexander M.; Brooks, David; Wei, Gu-Yeon

Full-text links:

Download:

Current browse context:

cs.AR

< prev | next >

new | recent | 2011

Computer Science > Hardware Architecture

Title: EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Authors: Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul N. Whatmough, Alexander M. Rush, David Brooks, Gu-Yeon Wei

(Submitted on 28 Nov 2020 (v1), last revised 6 Sep 2021 (this version, v5))

Abstract: Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimization for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.

Comments:	12 pages plus references. Paper to appear at the 54th IEEE/ACM International Symposium on Microarchitecture (MICRO 2021)
Subjects:	Hardware Architecture (cs.AR); Computation and Language (cs.CL)
Cite as:	arXiv:2011.14203 [cs.AR]
	(or arXiv:2011.14203v5 [cs.AR] for this version)

Submission history

From: Thierry Tambe [view email]
[v1] Sat, 28 Nov 2020 19:21:47 GMT (2048kb,D)
[v2] Tue, 1 Dec 2020 03:03:49 GMT (2048kb,D)
[v3] Mon, 15 Feb 2021 04:02:59 GMT (2113kb,D)
[v4] Sat, 17 Apr 2021 22:11:02 GMT (7982kb,D)
[v5] Mon, 6 Sep 2021 03:48:22 GMT (6981kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2011.14203

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Hardware Architecture

Title: EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Submission history