HistNERo: Historical Named Entity Recognition for the Romanian Language

Avram, Andrei-Marius; Iuga, Andreea; Manolache, George-Vlad; Matei, Vlad-Cristian; Micliuş, Răzvan-Gabriel; Muntean, Vlad-Andrei; Sorlescu, Manuel-Petru; Şerban, Dragoş-Andrei; Urse, Adrian-Dinu; Păiş, Vasile; Cercel, Dumitru-Clementin

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2405

Change to browse by:

References & Citations

NASA ADS

Bookmark

(what is this?)

Computer Science > Computation and Language

Title: HistNERo: Historical Named Entity Recognition for the Romanian Language

Authors: Andrei-Marius Avram, Andreea Iuga, George-Vlad Manolache, Vlad-Cristian Matei, Răzvan-Gabriel Micliuş, Vlad-Andrei Muntean, Manuel-Petru Sorlescu, Dragoş-Andrei Şerban, Adrian-Dinu Urse, Vasile Păiş, Dumitru-Clementin Cercel

(Submitted on 30 Apr 2024)

Abstract: This work introduces HistNERo, the first Romanian corpus for Named Entity Recognition (NER) in historical newspapers. The dataset contains 323k tokens of text, covering more than half of the 19th century (i.e., 1817) until the late part of the 20th century (i.e., 1990). Eight native Romanian speakers annotated the dataset with five named entities. The samples belong to one of the following four historical regions of Romania, namely Bessarabia, Moldavia, Transylvania, and Wallachia. We employed this proposed dataset to perform several experiments for NER using Romanian pre-trained language models. Our results show that the best model achieved a strict F1-score of 55.69%. Also, by reducing the discrepancies between regions through a novel domain adaption technique, we improved the performance on this corpus to a strict F1-score of 66.80%, representing an absolute gain of more than 10%.

Comments:	Accepted at the International Conference on Document Analysis and Recognition (ICDAR 2024)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.00155 [cs.CL]
	(or arXiv:2405.00155v1 [cs.CL] for this version)

Submission history

From: Andrei-Marius Avram [view email]
[v1] Tue, 30 Apr 2024 19:05:22 GMT (920kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.00155

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Computer Science > Computation and Language

Title: HistNERo: Historical Named Entity Recognition for the Romanian Language

Submission history