Donut: Document Understanding Transformer without OCR

Kim, Geewook; Hong, Teakgyu; Yim, Moonbin; Park, Jinyoung; Yim, Jinyeong; Hwang, Wonseok; Yun, Sangdoo; Han, Dongyoon; Park, Seunghyun

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2111

Computer Science > Machine Learning

Title: Donut: Document Understanding Transformer without OCR

Authors: Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park

(Submitted on 30 Nov 2021 (this version), latest version 6 Oct 2022 (v5))

Abstract: Understanding document images (e.g., invoices) has been an important research topic and has many applications in document processing automation. Through the latest advances in deep learning-based Optical Character Recognition (OCR), current Visual Document Understanding (VDU) systems have come to be designed based on OCR. Although such OCR-based approach promise reasonable performance, they suffer from critical problems induced by the OCR, e.g., (1) expensive computational costs and (2) performance degradation due to the OCR error propagation. In this paper, we propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. To this end, we propose a new task and a synthetic document image generator to pre-train the model to mitigate the dependencies on large-scale real document images. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets. Through extensive experiments and analysis, we demonstrate the effectiveness of the proposed model especially with consideration for a real-world application.

Comments:	12 pages, 6 figures
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2111.15664 [cs.LG]
	(or arXiv:2111.15664v1 [cs.LG] for this version)

Submission history

From: Geewook Kim [view email]
[v1] Tue, 30 Nov 2021 18:55:19 GMT (4992kb,D)
[v2] Thu, 21 Jul 2022 16:10:17 GMT (5924kb,D)
[v3] Tue, 23 Aug 2022 10:30:19 GMT (5924kb,D)
[v4] Tue, 4 Oct 2022 13:34:02 GMT (5928kb,D)
[v5] Thu, 6 Oct 2022 06:50:39 GMT (5928kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2111.15664v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Donut: Document Understanding Transformer without OCR

Submission history