Noisy Parallel Data Alignment

Xie, Ruoyu; Anastasopoulos, Antonios

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2301

Computer Science > Computation and Language

Title: Noisy Parallel Data Alignment

Authors: Ruoyu Xie, Antonios Anastasopoulos

(Submitted on 23 Jan 2023 (v1), last revised 10 Feb 2023 (this version, v2))

Abstract: An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.

Comments:	EACL 2023 camera-ready version
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Journal reference:	Findings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023)
Cite as:	arXiv:2301.09685 [cs.CL]
	(or arXiv:2301.09685v2 [cs.CL] for this version)

Submission history

From: Ruoyu Xie [view email]
[v1] Mon, 23 Jan 2023 19:26:34 GMT (6695kb,D)
[v2] Fri, 10 Feb 2023 17:21:23 GMT (6694kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2301.09685

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Noisy Parallel Data Alignment

Submission history