Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

Rugina, Ileana; Dangovski, Rumen; Jing, Li; Nakov, Preslav; Soljačić, Marin

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2012

Change to browse by:

Computer Science > Computation and Language

Title: Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

Authors: Ileana Rugina, Rumen Dangovski, Li Jing, Preslav Nakov, Marin Soljačić

(Submitted on 20 Nov 2020 (v1), last revised 8 May 2021 (this version, v2))

Abstract: The attention mechanism is a key component of the neural revolution in Natural Language Processing (NLP). As the size of attention-based models has been scaling with the available computational resources, a number of pruning techniques have been developed to detect and to exploit sparseness in such models in order to make them more efficient. The majority of such efforts have focused on looking for attention patterns and then hard-coding them to achieve sparseness, or pruning the weights of the attention mechanisms based on statistical information from the training data. Here, we marry these two lines of research by proposing Attention Pruning (AP): a novel pruning framework that collects observations about the attention patterns in a fixed dataset and then induces a global sparseness mask for the model. This can save 90% of the attention computation for language modelling and about 50% for machine translation and for solving GLUE tasks, while maintaining the quality of the results. Moreover, using our method, we discovered important distinctions between self- and cross-attention patterns, which could guide future NLP research in attention-based modelling. Our framework can in principle speed up any model that uses attention mechanism, thus helping develop better models for existing or for new NLP applications. Our implementation is available at this https URL

Comments:	13 pages, 6 figures, 10 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2012.02030 [cs.CL]
	(or arXiv:2012.02030v2 [cs.CL] for this version)

Submission history

From: Rumen Dangovski [view email]
[v1] Fri, 20 Nov 2020 13:58:21 GMT (7993kb,D)
[v2] Sat, 8 May 2021 23:24:17 GMT (993kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2012.02030

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

Submission history