We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:

References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Computation and Language

Title: Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

Abstract: English-based Vision-Language Pre-training (VLP) has achieved great success in various downstream tasks. Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training (M-VLP). However, due to the large number of languages, M-VLP models often require huge computing resources and cannot be flexibly extended to new languages. In this work, we propose a \textbf{M}ulti\textbf{L}ingual \textbf{A}cquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual. Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models. We further propose a two-stage training strategy to optimize the language acquisition encoder, namely the Native Language Transfer stage and the Language Exposure stage. With much less multilingual training data and computing resources, our model achieves state-of-the-art performance on multilingual image-text and video-text retrieval benchmarks.
Comments: 14 pages, 5 figures
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as: arXiv:2206.11091 [cs.CL]
  (or arXiv:2206.11091v1 [cs.CL] for this version)

Submission history

From: Liang Zhang [view email]
[v1] Sun, 29 May 2022 08:53:22 GMT (1099kb,D)

Link back to: arXiv, form interface, contact.