Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

Zhang, Boqiang; Xie, Hongtao; Gao, Zuan; Wang, Yuxin

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2405

Change to browse by:

References & Citations

NASA ADS

Bookmark

(what is this?)

Computer Science > Computer Vision and Pattern Recognition

Title: Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

Authors: Boqiang Zhang, Hongtao Xie, Zuan Gao, Yuxin Wang

(Submitted on 7 May 2024)

Abstract: Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.

Comments:	Accepted to CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.04377 [cs.CV]
	(or arXiv:2405.04377v1 [cs.CV] for this version)

Submission history

From: Boqiang Zhang [view email]
[v1] Tue, 7 May 2024 15:00:11 GMT (3497kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.04377

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

Submission history