Masked Vision-Language Transformer in Fashion

Ji, Ge-Peng; Zhuge, Mingcheng; Gao, Dehong; Fan, Deng-Ping; Sakaridis, Christos; Van Gool, Luc

doi:10.1007/s11633-022-1394-4

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2210

Computer Science > Computer Vision and Pattern Recognition

Title: Masked Vision-Language Transformer in Fashion

Authors: Ge-Peng Ji, Mingcheng Zhuge, Dehong Gao, Deng-Ping Fan, Christos Sakaridis, Luc Van Gool

(Submitted on 27 Oct 2022)

Abstract: We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at this https URL

Comments:	Accepted by Machine Intelligence Research (2023)
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Journal reference:	Machine Intelligence Research. 20, 421-434 (2023)
DOI:	10.1007/s11633-022-1394-4
Cite as:	arXiv:2210.15110 [cs.CV]
	(or arXiv:2210.15110v1 [cs.CV] for this version)

Submission history

From: Ge-Peng Ji [view email]
[v1] Thu, 27 Oct 2022 01:44:08 GMT (947kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2210.15110

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Masked Vision-Language Transformer in Fashion

Submission history