We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Computation and Language

Title: Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Abstract: Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by non-standard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.
Comments: Published in the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Journal reference: 2020.emnlp-main.455
Cite as: arXiv:2004.14870 [cs.CL]
  (or arXiv:2004.14870v4 [cs.CL] for this version)

Submission history

From: Samson Tan [view email]
[v1] Thu, 30 Apr 2020 15:15:40 GMT (64kb,D)
[v2] Sun, 11 Oct 2020 18:54:40 GMT (7229kb,D)
[v3] Fri, 16 Oct 2020 05:20:28 GMT (7230kb,D)
[v4] Wed, 18 Nov 2020 06:16:31 GMT (7229kb,D)

Link back to: arXiv, form interface, contact.