We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Corrected CBOW Performs as well as Skip-gram

Abstract: Mikolov et al. (2013a) observed that continuous bag-of-words (CBOW) word embeddings tend to underperform Skip-gram (SG) embeddings, and this finding has been reported in subsequent works. We find that these observations are driven not by fundamental differences in their training objectives, but more likely on faulty negative sampling CBOW implementations in popular libraries such as the official implementation, word2vec.c, and Gensim. We show that after correcting a bug in the CBOW gradient update, one can learn CBOW word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks, while being many times faster to train.
Comments: Presented at WINR at EMNLP 2021, added discussion about FastText, more discussion about findings, additional results on C4 data, wording changes
Subjects: Computation and Language (cs.CL); Machine Learning (stat.ML)
Cite as: arXiv:2012.15332 [cs.CL]
  (or arXiv:2012.15332v2 [cs.CL] for this version)

Submission history

From: Ozan İrsoy [view email]
[v1] Wed, 30 Dec 2020 21:37:28 GMT (140kb,D)
[v2] Tue, 9 Nov 2021 16:28:00 GMT (162kb,D)

Link back to: arXiv, form interface, contact.