Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Cornia, Marcella; Baraldi, Lorenzo; Fiameni, Giuseppe; Cucchiara, Rita

Full-text links:

Download:

Computer Science > Computer Vision and Pattern Recognition

Title: Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Authors: Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, Rita Cucchiara

(Submitted on 24 Nov 2021 (v1), last revised 30 Nov 2023 (this version, v3))

Abstract: This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.

Comments:	Accepted to IJCV
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Cite as:	arXiv:2111.12727 [cs.CV]
	(or arXiv:2111.12727v3 [cs.CV] for this version)

Submission history

From: Marcella Cornia [view email]
[v1] Wed, 24 Nov 2021 19:00:05 GMT (3868kb,D)
[v2] Tue, 29 Mar 2022 12:07:47 GMT (3960kb,D)
[v3] Thu, 30 Nov 2023 11:47:36 GMT (3943kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2111.12727

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Submission history