SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

Thangarasa, Vithursan; Gupta, Abhay; Marshall, William; Li, Tianda; Leong, Kevin; DeCoste, Dennis; Lie, Sean; Saxena, Shreyas

Full-text links:

Download:

Computer Science > Machine Learning

Title: SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

Authors: Vithursan Thangarasa, Abhay Gupta, William Marshall, Tianda Li, Kevin Leong, Dennis DeCoste, Sean Lie, Shreyas Saxena

(Submitted on 18 Mar 2023 (v1), last revised 29 Jul 2023 (this version, v2))

Abstract: The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also lead to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity, while retaining the benefits of pre-trained textual representations for downstream tasks.

Comments:	Accepted to Uncertainty in Artificial Intelligence (UAI) 2023 Conference; 13 pages, 4 figures (Main Paper) + 5 pages (Supplementary Material)
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2303.10464 [cs.LG]
	(or arXiv:2303.10464v2 [cs.LG] for this version)

Submission history

From: Vithursan Thangarasa [view email]
[v1] Sat, 18 Mar 2023 17:56:01 GMT (316kb,D)
[v2] Sat, 29 Jul 2023 19:56:50 GMT (579kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2303.10464

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

Submission history