DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

Chen, Xuxi; Chen, Tianlong; Chen, Weizhu; Awadallah, Ahmed Hassan; Wang, Zhangyang; Cheng, Yu

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2111

Computer Science > Machine Learning

Title: DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

Authors: Xuxi Chen, Tianlong Chen, Weizhu Chen, Ahmed Hassan Awadallah, Zhangyang Wang, Yu Cheng

(Submitted on 30 Oct 2021 (v1), last revised 24 May 2023 (this version, v3))

Abstract: Gigantic pre-trained models have become central to natural language processing (NLP), serving as the starting point for fine-tuning towards a range of downstream tasks. However, two pain points persist for this paradigm: (a) as the pre-trained models grow bigger (e.g., 175B parameters for GPT-3), even the fine-tuning process can be time-consuming and computationally expensive; (b) the fine-tuned model has the same size as its starting point by default, which is neither sensible due to its more specialized functionality, nor practical since many fine-tuned models will be deployed in resource-constrained environments. To address these pain points, we propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning - by enforcing sparsity-aware low-rank updates on top of the pre-trained weights; and (ii) resource-efficient inference - by encouraging a sparse weight structure towards the final fine-tuned model. We leverage sparsity in these two directions by exploiting both unstructured and structured sparse patterns in pre-trained language models via a unified approach. Extensive experiments and in-depth investigations, with diverse network backbones (i.e., BERT, RoBERTa, and GPT-2) on dozens of datasets, consistently demonstrate impressive parameter-/inference-efficiency, while maintaining competitive downstream performance. For instance, DSEE saves about 25% inference FLOPs while achieving comparable performance, with 0.5% trainable parameters on BERT. Codes are available in this https URL

Comments:	Accepted by ACL 2023
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2111.00160 [cs.LG]
	(or arXiv:2111.00160v3 [cs.LG] for this version)

Submission history

From: Xuxi Chen [view email]
[v1] Sat, 30 Oct 2021 03:29:47 GMT (1066kb,D)
[v2] Sun, 31 Jul 2022 16:30:56 GMT (1116kb,D)
[v3] Wed, 24 May 2023 02:29:37 GMT (966kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2111.00160

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

Submission history