We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.LG

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Machine Learning

Title: Rethinking Attention with Performers

Abstract: We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
Comments: Published as a conference paper + oral presentation at ICLR 2021. 38 pages. This is an updated version of a previous submission which can be found at arXiv:2006.03555. See this https URL for protein language model code, and this https URL for Performer code
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Cite as: arXiv:2009.14794 [cs.LG]
  (or arXiv:2009.14794v2 [cs.LG] for this version)

Submission history

From: Valerii Likhosherstov [view email]
[v1] Wed, 30 Sep 2020 17:09:09 GMT (10282kb,D)
[v2] Tue, 16 Feb 2021 21:40:24 GMT (13996kb,D)
[v3] Tue, 9 Mar 2021 16:26:47 GMT (13996kb,D)
[v4] Sat, 19 Nov 2022 12:45:21 GMT (27987kb,D)

Link back to: arXiv, form interface, contact.