Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

Parcalabescu, Letitia; Gatt, Albert; Frank, Anette; Calixto, Iacer

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2012

Computer Science > Computer Vision and Pattern Recognition

Title: Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

Authors: Letitia Parcalabescu, Albert Gatt, Anette Frank, Iacer Calixto

(Submitted on 22 Dec 2020 (v1), last revised 17 Jun 2021 (this version, v4))

Abstract: We investigate the reasoning ability of pretrained vision and language (V&L) models in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models are pretrained on task (1). However, none of the pretrained V&L models is able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. We propose a number of explanations for these findings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence of catastrophic forgetting on task (1). Concerning our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input. While a selling point of pretrained V&L models is their ability to solve complex tasks, our findings suggest that understanding their reasoning and grounding capabilities requires more targeted investigations on specific phenomena.

Comments:	Paper accepted for publication at MMSR 2021; 13 pages, 3 figures, 7 Tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
MSC classes:	68Txx
ACM classes:	I.2.7; I.2.10
Journal reference:	Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR), 2021, Groningen, Netherlands (Online), Association for Computational Linguistics, p. 32--44
Cite as:	arXiv:2012.12352 [cs.CV]
	(or arXiv:2012.12352v4 [cs.CV] for this version)

Submission history

From: Letitia Parcalabescu [view email]
[v1] Tue, 22 Dec 2020 21:01:44 GMT (8178kb,D)
[v2] Fri, 7 May 2021 16:38:24 GMT (6305kb,D)
[v3] Mon, 24 May 2021 11:03:08 GMT (6305kb,D)
[v4] Thu, 17 Jun 2021 17:51:56 GMT (6305kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2012.12352

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

Submission history