Balancing the composition of word embeddings across heterogenous data sets

Brandl, Stephanie; Lassner, David; Alber, Maximilian

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2001

Computer Science > Computation and Language

Title: Balancing the composition of word embeddings across heterogenous data sets

Authors: Stephanie Brandl, David Lassner, Maximilian Alber

(Submitted on 14 Jan 2020)

Abstract: Word embeddings capture semantic relationships based on contextual information and are the basis for a wide variety of natural language processing applications. Notably these relationships are solely learned from the data and subsequently the data composition impacts the semantic of embeddings which arguably can lead to biased word vectors. Given qualitatively different data subsets, we aim to align the influence of single subsets on the resulting word vectors, while retaining their quality. In this regard we propose a criteria to measure the shift towards a single data subset and develop approaches to meet both objectives. We find that a weighted average of the two subset embeddings balances the influence of those subsets while word similarity performance decreases. We further propose a promising optimization approach to balance influences and quality of word embeddings.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2001.04693 [cs.CL]
	(or arXiv:2001.04693v1 [cs.CL] for this version)

Submission history

From: Stephanie Brandl [view email]
[v1] Tue, 14 Jan 2020 10:12:50 GMT (427kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2001.04693

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Balancing the composition of word embeddings across heterogenous data sets

Submission history