VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Liu, Yuliang; Huang, Mingxin; Yan, Hao; Deng, Linger; Wu, Weijia; Lu, Hao; Shen, Chunhua; Jin, Lianwen; Bai, Xiang

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2404

Computer Science > Computer Vision and Pattern Recognition

Title: VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Authors: Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu, Chunhua Shen, Lianwen Jin, Xiang Bai

(Submitted on 30 Apr 2024 (v1), last revised 14 May 2024 (this version, v3))

Abstract: Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2404.19652 [cs.CV]
	(or arXiv:2404.19652v3 [cs.CV] for this version)

Submission history

From: Yuliang Liu [view email]
[v1] Tue, 30 Apr 2024 15:49:03 GMT (2473kb,D)
[v2] Sun, 5 May 2024 01:26:55 GMT (2490kb,D)
[v3] Tue, 14 May 2024 15:07:05 GMT (2448kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.19652

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Submission history