References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: DocFormer: End-to-End Transformer for Document Understanding
(Submitted on 22 Jun 2021 (v1), last revised 20 Sep 2021 (this version, v2))
Abstract: We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).
Submission history
From: Srikar Appalaraju [view email][v1] Tue, 22 Jun 2021 04:28:07 GMT (28294kb,D)
[v2] Mon, 20 Sep 2021 06:12:57 GMT (28305kb,D)
Link back to: arXiv, form interface, contact.