References & Citations
Computer Science > Computation and Language
Title: Probing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders
(Submitted on 30 Apr 2022 (v1), last revised 13 Oct 2022 (this version, v2))
Abstract: Pretrained multilingual language models (LMs) can be successfully transformed into multilingual sentence encoders (SEs; e.g., LaBSE, xMPNet) via additional fine-tuning or model distillation with parallel data. However, it remains unclear how to best leverage them to represent sub-sentence lexical items (i.e., words and phrases) in cross-lingual lexical tasks. In this work, we probe SEs for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs. We also devise a simple yet efficient method for exposing the cross-lingual lexical knowledge by means of additional fine-tuning through inexpensive contrastive learning that requires only a small amount of word translation pairs. Using bilingual lexical induction (BLI), cross-lingual lexical semantic similarity, and cross-lingual entity linking as lexical probing tasks, we report substantial gains on standard benchmarks (e.g., +10 Precision@1 points in BLI). The results indicate that the SEs such as LaBSE can be 'rewired' into effective cross-lingual lexical encoders via the contrastive learning procedure, and that they contain more cross-lingual lexical knowledge than what 'meets the eye' when they are used as off-the-shelf SEs. This way, we also provide an effective tool for harnessing 'covert' multilingual lexical knowledge hidden in multilingual sentence encoders.
Submission history
From: Ivan Vulić [view email][v1] Sat, 30 Apr 2022 13:23:16 GMT (383kb,D)
[v2] Thu, 13 Oct 2022 11:58:27 GMT (391kb,D)
Link back to: arXiv, form interface, contact.