Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

Carrino, Casimiro Pio; Armengol-Estapé, Jordi; Bonet, Ona de Gibert; Gutiérrez-Fandiño, Asier; Gonzalez-Agirre, Aitor; Krallinger, Martin; Villegas, Marta

Full-text links:

Download:

Computer Science > Computation and Language

Title: Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

Authors: Casimiro Pio Carrino, Jordi Armengol-Estapé, Ona de Gibert Bonet, Asier Gutiérrez-Fandiño, Aitor Gonzalez-Agirre, Martin Krallinger, Marta Villegas

(Submitted on 16 Sep 2021)

Abstract: We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical and health NLP in Spanish and has already been employed to train domain-specific language models and to produce word embbedings. We released the CoWeSe corpus under a Creative Commons Attribution 4.0 International license, both in Zenodo (\url{this https URL}).

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2109.07765 [cs.CL]
	(or arXiv:2109.07765v1 [cs.CL] for this version)

Submission history

From: Casimiro Pio Carrino [view email]
[v1] Thu, 16 Sep 2021 07:22:28 GMT (21kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2109.07765

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

Submission history