We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CV

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computer Vision and Pattern Recognition

Title: Masked Vision-Language Transformer in Fashion

Abstract: We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at this https URL
Comments: Accepted by Machine Intelligence Research (2023)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Journal reference: Machine Intelligence Research. 20, 421-434 (2023)
DOI: 10.1007/s11633-022-1394-4
Cite as: arXiv:2210.15110 [cs.CV]
  (or arXiv:2210.15110v1 [cs.CV] for this version)

Submission history

From: Ge-Peng Ji [view email]
[v1] Thu, 27 Oct 2022 01:44:08 GMT (947kb,D)

Link back to: arXiv, form interface, contact.