Cross-Modal Adapter for Text-Video Retrieval

Jiang, Haojun; Zhang, Jianke; Huang, Rui; Ge, Chunjiang; Ni, Zanlin; Lu, Jiwen; Zhou, Jie; Song, Shiji; Huang, Gao

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2211

Computer Science > Computer Vision and Pattern Recognition

Title: Cross-Modal Adapter for Text-Video Retrieval

Authors: Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Jie Zhou, Shiji Song, Gao Huang

(Submitted on 17 Nov 2022)

Abstract: Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve the most relevant video for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on this task. However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel $\textbf{Cross-Modal Adapter}$ for parameter-efficient fine-tuning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows early cross-modal interactions between CLIP's two encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces $\textbf{99.6}\%$ of fine-tuned parameters, and alleviates the problem of overfitting, (2) saves approximately 30% of training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, it achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets. The code will be available at \url{this https URL}.

Comments:	Tech Report
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2211.09623 [cs.CV]
	(or arXiv:2211.09623v1 [cs.CV] for this version)

Submission history

From: Haojun Jiang [view email]
[v1] Thu, 17 Nov 2022 16:15:30 GMT (1489kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2211.09623

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Cross-Modal Adapter for Text-Video Retrieval

Submission history