References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: Who are you referring to? Weakly supervised coreference resolution with multimodal grounding
(Submitted on 26 Nov 2022 (this version), latest version 17 Mar 2023 (v2))
Abstract: Coreference resolution aims at identifying words and phrases which refer to same entity in a text, a core tool in natural language processing. In this paper, we propose a novel task, resolving coreferences in multimodal data, long-form textual descriptions of visual scenes. Most existing image-text datasets only contain short sentences without coreferent expressions, or coreferences are not annotated. To this end, we first introduce a new dataset, Flickr30k-Coref in which coreference chains and bounding box localization of these chains are annotated. We propose a new technique that learns to identify coreference chains through weakly supervised grounding from image-text pairs and a regularization using prior linguistic knowledge. Our model yields large performance gains over prior work in coreference resolution and weakly supervised grounding of long-form text descriptions.
Submission history
From: Arushi Goel [view email][v1] Sat, 26 Nov 2022 13:33:42 GMT (36741kb,D)
[v2] Fri, 17 Mar 2023 15:12:13 GMT (39605kb,D)
Link back to: arXiv, form interface, contact.