LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Xu, Yang; Xu, Yiheng; Lv, Tengchao; Cui, Lei; Wei, Furu; Wang, Guoxin; Lu, Yijuan; Florencio, Dinei; Zhang, Cha; Che, Wanxiang; Zhang, Min; Zhou, Lidong

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2012

Change to browse by:

Computer Science > Computation and Language

Title: LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Authors: Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou

(Submitted on 29 Dec 2020 (v1), last revised 10 Jan 2022 (this version, v4))

Abstract: Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$ 0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520), RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made our model and code publicly available at \url{this https URL}.

Comments:	ACL 2021 main conference
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2012.14740 [cs.CL]
	(or arXiv:2012.14740v4 [cs.CL] for this version)

Submission history

From: Lei Cui [view email]
[v1] Tue, 29 Dec 2020 13:01:52 GMT (5925kb,D)
[v2] Thu, 6 May 2021 07:02:57 GMT (5925kb,D)
[v3] Tue, 11 May 2021 06:42:33 GMT (5925kb,D)
[v4] Mon, 10 Jan 2022 04:08:10 GMT (5845kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2012.14740

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Submission history