We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Language Modelling via Learning to Rank

Abstract: We consider language modelling (LM) as a multi-label structured prediction task by re-framing training from solely predicting a single ground-truth word to ranking a set of words which could continue a given context. To avoid annotating top-$k$ ranks, we generate them using pre-trained LMs: GPT-2, BERT, and Born-Again models. This leads to a rank-based form of knowledge distillation (KD). We also develop a method using $N$-grams to create a non-probabilistic teacher which generates the ranks without the need of a pre-trained LM.
We confirm the hypotheses that we can treat LMing as a ranking task and that we can do so without the use of a pre-trained LM. We show that rank-based KD generally improves perplexity (PPL), often with statistical significance, when compared to Kullback-Leibler-based KD. Surprisingly, given the simplicity of the method, $N$-grams act as competitive teachers and achieve similar performance as using either BERT or a Born-Again model teachers. GPT-2 always acts as the best teacher, though, and using it and a Transformer-XL student on Wiki-02, rank-based KD reduces a cross-entropy baseline from 65.27 to 55.94 and against a KL-based KD of 56.70.
Comments: Accepted to AAAI22. Minor writing fixes
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
ACM classes: I.2.7; I.2.6
Cite as: arXiv:2110.06961 [cs.CL]
  (or arXiv:2110.06961v2 [cs.CL] for this version)

Submission history

From: Arvid Frydenlund [view email]
[v1] Wed, 13 Oct 2021 18:03:47 GMT (227kb)
[v2] Fri, 10 Dec 2021 19:49:23 GMT (226kb)

Link back to: arXiv, form interface, contact.