Data-Efficient Information Extraction from Form-Like Documents

Gunel, Beliz; Potti, Navneet; Tata, Sandeep; Wendt, James B.; Najork, Marc; Xie, Jing

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2201

Computer Science > Machine Learning

Title: Data-Efficient Information Extraction from Form-Like Documents

Authors: Beliz Gunel, Navneet Potti, Sandeep Tata, James B. Wendt, Marc Najork, Jing Xie

(Submitted on 7 Jan 2022)

Abstract: Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages.
In this paper, we show that when we have only a small number of labeled documents for training (~50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this.

Comments:	Published at the 2nd Document Intelligence Workshop @ KDD 2021 (this https URL)
Subjects:	Machine Learning (cs.LG); Information Retrieval (cs.IR)
Cite as:	arXiv:2201.02647 [cs.LG]
	(or arXiv:2201.02647v1 [cs.LG] for this version)

Submission history

From: Marc Najork [view email]
[v1] Fri, 7 Jan 2022 19:16:49 GMT (1162kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2201.02647

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Data-Efficient Information Extraction from Form-Like Documents

Submission history