We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:

References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Computation and Language

Title: VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Abstract: We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.
Comments: Paper accepted for publication at ACL 2022 Main; 28 pages, 4 figures, 11 tables
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
MSC classes: 68Txx
ACM classes: I.2.7; I.2.10
Cite as: arXiv:2112.07566 [cs.CL]
  (or arXiv:2112.07566v2 [cs.CL] for this version)

Submission history

From: Letitia Parcalabescu [view email]
[v1] Tue, 14 Dec 2021 17:15:04 GMT (16538kb,D)
[v2] Mon, 14 Mar 2022 15:08:08 GMT (16547kb,D)

Link back to: arXiv, form interface, contact.