A Probabilistic Interpretation of Transformers

Shim, Alexander

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2205

Change to browse by:

Computer Science > Machine Learning

Title: A Probabilistic Interpretation of Transformers

Authors: Alexander Shim

(Submitted on 28 Apr 2022)

Abstract: We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. The attention sublayer of transformers is equivalent to a gradient ascent step of the log normalizer, which is the log-sum-exp term in the Hopfield theory of attention. This ascent step induces a parallel expansion of points, which is counterbalanced by a contraction from layer normalization. We also state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.

Comments:	Accepted in ICML 2021 Workshop: Self-Supervised Learning for Reasoning and Perception
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2205.01080 [cs.LG]
	(or arXiv:2205.01080v1 [cs.LG] for this version)

Submission history

From: Alexander Shim [view email]
[v1] Thu, 28 Apr 2022 23:05:02 GMT (37kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2205.01080

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: A Probabilistic Interpretation of Transformers

Submission history