References & Citations
Computer Science > Computation and Language
Title: Differentiable Subset Pruning of Transformer Heads
(Submitted on 10 Aug 2021 (v1), last revised 27 Jul 2023 (this version, v3))
Abstract: Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns per-head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
Submission history
From: Jiaoda Li [view email][v1] Tue, 10 Aug 2021 13:08:34 GMT (185kb,D)
[v2] Sun, 22 Aug 2021 16:00:09 GMT (224kb,D)
[v3] Thu, 27 Jul 2023 07:14:18 GMT (224kb,D)
Link back to: arXiv, form interface, contact.