Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders

Vulić, Ivan; Glavaš, Goran; Liu, Fangyu; Collier, Nigel; Ponti, Edoardo Maria; Korhonen, Anna

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2205

Change to browse by:

Computer Science > Computation and Language

Title: Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders

Authors: Ivan Vulić, Goran Glavaš, Fangyu Liu, Nigel Collier, Edoardo Maria Ponti, Anna Korhonen

(Submitted on 30 Apr 2022 (this version), latest version 13 Oct 2022 (v2))

Abstract: Pretrained multilingual language models (LMs) can be successfully transformed into multilingual sentence encoders (SEs; e.g., LaBSE, xMPNET) via additional fine-tuning or model distillation on parallel data. However, it remains uncertain how to best leverage their knowledge to represent sub-sentence lexical items (i.e., words and phrases) in cross-lingual lexical tasks. In this work, we probe these SEs for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs. We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models through inexpensive contrastive learning procedure, requiring only a small amount of word translation pairs. We evaluate our method on bilingual lexical induction (BLI), cross-lingual lexical semantic similarity, and cross-lingual entity linking, and report substantial gains on standard benchmarks (e.g., +10 Precision@1 points in BLI), validating that the SEs such as LaBSE can be 'rewired' into effective cross-lingual lexical encoders. Moreover, we show that resulting representations can be successfully interpolated with static embeddings from cross-lingual word embedding spaces to further boost the performance in lexical tasks. In sum, our approach provides an effective tool for exposing and harnessing multilingual lexical knowledge 'hidden' in multilingual sentence encoders.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2205.00267 [cs.CL]
	(or arXiv:2205.00267v1 [cs.CL] for this version)

Submission history

From: Ivan Vulić [view email]
[v1] Sat, 30 Apr 2022 13:23:16 GMT (383kb,D)
[v2] Thu, 13 Oct 2022 11:58:27 GMT (391kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2205.00267v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders

Submission history