Probing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders

Vulić, Ivan; Glavaš, Goran; Liu, Fangyu; Collier, Nigel; Ponti, Edoardo Maria; Korhonen, Anna

Full-text links:

Download:

Computer Science > Computation and Language

Title: Probing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders

Authors: Ivan Vulić, Goran Glavaš, Fangyu Liu, Nigel Collier, Edoardo Maria Ponti, Anna Korhonen

(Submitted on 30 Apr 2022 (v1), last revised 13 Oct 2022 (this version, v2))

Abstract: Pretrained multilingual language models (LMs) can be successfully transformed into multilingual sentence encoders (SEs; e.g., LaBSE, xMPNet) via additional fine-tuning or model distillation with parallel data. However, it remains unclear how to best leverage them to represent sub-sentence lexical items (i.e., words and phrases) in cross-lingual lexical tasks. In this work, we probe SEs for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs. We also devise a simple yet efficient method for exposing the cross-lingual lexical knowledge by means of additional fine-tuning through inexpensive contrastive learning that requires only a small amount of word translation pairs. Using bilingual lexical induction (BLI), cross-lingual lexical semantic similarity, and cross-lingual entity linking as lexical probing tasks, we report substantial gains on standard benchmarks (e.g., +10 Precision@1 points in BLI). The results indicate that the SEs such as LaBSE can be 'rewired' into effective cross-lingual lexical encoders via the contrastive learning procedure, and that they contain more cross-lingual lexical knowledge than what 'meets the eye' when they are used as off-the-shelf SEs. This way, we also provide an effective tool for harnessing 'covert' multilingual lexical knowledge hidden in multilingual sentence encoders.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2205.00267 [cs.CL]
	(or arXiv:2205.00267v2 [cs.CL] for this version)

Submission history

From: Ivan Vulić [view email]
[v1] Sat, 30 Apr 2022 13:23:16 GMT (383kb,D)
[v2] Thu, 13 Oct 2022 11:58:27 GMT (391kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2205.00267

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Probing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders

Submission history