Building Chinese Biomedical Language Models via Multi-Level Text Discrimination

Wang, Quan; Dai, Songtai; Xu, Benfeng; Lyu, Yajuan; Zhu, Yong; Wu, Hua; Wang, Haifeng

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2110

Computer Science > Computation and Language

Title: Building Chinese Biomedical Language Models via Multi-Level Text Discrimination

Authors: Quan Wang, Songtai Dai, Benfeng Xu, Yajuan Lyu, Yong Zhu, Hua Wu, Haifeng Wang

(Submitted on 14 Oct 2021 (this version), latest version 2 Mar 2022 (v2))

Abstract: Pre-trained language models (PLMs), such as BERT and GPT, have revolutionized the field of NLP, not only in the general domain but also in the biomedical domain. Most prior efforts in building biomedical PLMs have resorted simply to domain adaptation and focused mainly on English. In this work we introduce eHealth, a biomedical PLM in Chinese built with a new pre-training framework. This new framework trains eHealth as a discriminator through both token-level and sequence-level discrimination. The former is to detect input tokens corrupted by a generator and select their original signals from plausible candidates, while the latter is to further distinguish corruptions of a same original sequence from those of the others. As such, eHealth can learn language semantics at both the token and sequence levels. Extensive experiments on 11 Chinese biomedical language understanding tasks of various forms verify the effectiveness and superiority of our approach. The pre-trained model is available to the public at \url{this https URL} and the code will also be released later.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2110.07244 [cs.CL]
	(or arXiv:2110.07244v1 [cs.CL] for this version)

Submission history

From: Quan Wang [view email]
[v1] Thu, 14 Oct 2021 10:43:28 GMT (356kb,D)
[v2] Wed, 2 Mar 2022 10:04:24 GMT (473kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2110.07244v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Building Chinese Biomedical Language Models via Multi-Level Text Discrimination

Submission history