We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:


References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Computer Vision and Pattern Recognition

Title: OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

Abstract: In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, \ie, token-, modality-, and sample-level modeling, through which OPT learns to align and translate among different modalities. The pre-training task is carried out on a large amount of image-text-audio triplets from Open Images. Experimental results show that OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2107.00249 [cs.CV]
  (or arXiv:2107.00249v1 [cs.CV] for this version)

Submission history

From: Xinxin Zhu [view email]
[v1] Thu, 1 Jul 2021 06:59:44 GMT (1860kb,D)
[v2] Tue, 6 Jul 2021 03:18:27 GMT (1858kb,D)

Link back to: arXiv, form interface, contact.