References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization
(Submitted on 30 Apr 2024 (v1), last revised 14 May 2024 (this version, v3))
Abstract: Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the this https URL
Submission history
From: Yuliang Liu [view email][v1] Tue, 30 Apr 2024 15:49:03 GMT (2473kb,D)
[v2] Sun, 5 May 2024 01:26:55 GMT (2490kb,D)
[v3] Tue, 14 May 2024 15:07:05 GMT (2448kb,D)
Link back to: arXiv, form interface, contact.