References & Citations
Computer Science > Computation and Language
Title: Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding
(Submitted on 1 Jun 2023 (v1), last revised 8 Jul 2023 (this version, v2))
Abstract: Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and ultimately runtime latency of transformer-based models. We compress the embedding and linear layers of transformers into small low-rank tensor cores, which significantly reduces model parameters. A quantization-aware training with learnable scale factors is used to further obtain low-precision representations of the tensor-compressed models. The developed approach can be used for both end-to-end training and distillation-based training. To improve the convergence, a layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to $63\times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
Submission history
From: Zi Yang [view email][v1] Thu, 1 Jun 2023 18:32:08 GMT (595kb)
[v2] Sat, 8 Jul 2023 04:29:09 GMT (183kb)
Link back to: arXiv, form interface, contact.