Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Ge, Xuri; Chen, Fuhai; Jose, Joemon M.; Ji, Zhilong; Wu, Zhongqin; Liu, Xiao

doi:10.1145/3474085.3475634

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2108

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Authors: Xuri Ge, Fuhai Chen, Joemon M. Jose, Zhilong Ji, Zhongqin Wu, Xiao Liu

(Submitted on 5 Aug 2021)

Abstract: The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog $\to$ play $\to$ ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In order to jointly and explicitly learn the visual-textual embedding and the cross-modal alignment, SMFEA creates a novel multi-modal structured module with a shared context-aware referral tree. In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels, from which visual and textual features can be jointly learned and optimized. We utilize the multi-modal tree structure to explicitly align the heterogeneous image-sentence data by maximizing the semantic and structural similarity between corresponding inter-modal tree nodes. Extensive experiments on Microsoft COCO and Flickr30K benchmarks demonstrate the superiority of the proposed model in comparison to the state-of-the-art methods.

Comments:	9 pages, 7 figures, Accepted by ACM MM 2021
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
DOI:	10.1145/3474085.3475634
Cite as:	arXiv:2108.02417 [cs.CV]
	(or arXiv:2108.02417v1 [cs.CV] for this version)

Submission history

From: Xuri Ge [view email]
[v1] Thu, 5 Aug 2021 07:24:54 GMT (7036kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2108.02417

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Submission history