We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.LG

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Machine Learning

Title: Donut: Document Understanding Transformer without OCR

Abstract: Understanding document images (e.g., invoices) has been an important research topic and has many applications in document processing automation. Through the latest advances in deep learning-based Optical Character Recognition (OCR), current Visual Document Understanding (VDU) systems have come to be designed based on OCR. Although such OCR-based approach promise reasonable performance, they suffer from critical problems induced by the OCR, e.g., (1) expensive computational costs and (2) performance degradation due to the OCR error propagation. In this paper, we propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. To this end, we propose a new task and a synthetic document image generator to pre-train the model to mitigate the dependencies on large-scale real document images. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets. Through extensive experiments and analysis, we demonstrate the effectiveness of the proposed model especially with consideration for a real-world application.
Comments: 12 pages, 6 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as: arXiv:2111.15664 [cs.LG]
  (or arXiv:2111.15664v1 [cs.LG] for this version)

Submission history

From: Geewook Kim [view email]
[v1] Tue, 30 Nov 2021 18:55:19 GMT (4992kb,D)
[v2] Thu, 21 Jul 2022 16:10:17 GMT (5924kb,D)
[v3] Tue, 23 Aug 2022 10:30:19 GMT (5924kb,D)
[v4] Tue, 4 Oct 2022 13:34:02 GMT (5928kb,D)
[v5] Thu, 6 Oct 2022 06:50:39 GMT (5928kb,D)

Link back to: arXiv, form interface, contact.