Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Zhou, Kaitlyn; Ethayarajh, Kawin; Card, Dallas; Jurafsky, Dan

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2205

Computer Science > Computation and Language

Title: Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Authors: Kaitlyn Zhou, Kawin Ethayarajh, Dallas Card, Dan Jurafsky

(Submitted on 10 May 2022)

Abstract: Cosine similarity of contextual embeddings is used in many NLP tasks (e.g., QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.

Comments:	Camera Ready for ACL 2022 (Main Conference)
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2205.05092 [cs.CL]
	(or arXiv:2205.05092v1 [cs.CL] for this version)

Submission history

From: Kaitlyn Zhou [view email]
[v1] Tue, 10 May 2022 18:00:06 GMT (2775kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2205.05092

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Submission history