We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Compressing Word Embeddings Using Syllables

Abstract: This work examines the possibility of using syllable embeddings, instead of the often used $n$-gram embeddings, as subword embeddings. We investigate this for two languages: English and Dutch. To this end, we also translated two standard English word embedding evaluation datasets, WordSim353 and SemEval-2017, to Dutch. Furthermore, we provide the research community with data sets of syllabic decompositions for both languages. We compare our approach to full word and $n$-gram embeddings. Compared to full word embeddings, we obtain English models that are 20 to 30 times smaller while retaining 80% of the performance. For Dutch, models are 15 times smaller for 70% performance retention. Although less accurate than the $n$-gram baseline we used, our models can be trained in a matter of minutes, as opposed to hours for the $n$-gram approach. We identify a path toward upgrading performance in future work. All code is made publicly available, as well as our collected English and Dutch syllabic decompositions and Dutch evaluation set translations.
Comments: 19 pages 3 figures 11 tables
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as: arXiv:2201.04913 [cs.CL]
  (or arXiv:2201.04913v1 [cs.CL] for this version)

Submission history

From: Laurent Mertens [view email]
[v1] Thu, 13 Jan 2022 12:09:44 GMT (108kb,D)

Link back to: arXiv, form interface, contact.