Is Less More? Quality, Quantity and Context in Idiom Processing with Natural Language Models

Knietaite, Agne; Allsebrook, Adam; Minkov, Anton; Tomaszewski, Adam; Slinko, Norbert; Johnson, Richard; Pickard, Thomas; Phelps, Dylan; Villavicencio, Aline

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2405

Change to browse by:

Computer Science > Computation and Language

Title: Is Less More? Quality, Quantity and Context in Idiom Processing with Natural Language Models

Authors: Agne Knietaite, Adam Allsebrook, Anton Minkov, Adam Tomaszewski, Norbert Slinko, Richard Johnson, Thomas Pickard, Dylan Phelps, Aline Villavicencio

(Submitted on 14 May 2024)

Abstract: Compositionality in language models presents a problem when processing idiomatic expressions, as their meaning often cannot be directly derived from their individual parts. Although fine-tuning and other optimization strategies can be used to improve representations of idiomatic expressions, this depends on the availability of relevant data. We present the Noun Compound Synonym Substitution in Books - NCSSB - datasets, which are created by substitution of synonyms of potentially idiomatic English noun compounds in public domain book texts. We explore the trade-off between data quantity and quality when training models for idiomaticity detection, in conjunction with contextual information obtained locally (from the surrounding sentences) or externally (through language resources). Performance on an idiomaticity detection task indicates that dataset quality is a stronger factor for context-enriched models, but that quantity also plays a role in models without context inclusion strategies.

Comments:	14 pages, 10 figures. Presented at the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD 2024) this https URL
Subjects:	Computation and Language (cs.CL)
ACM classes:	I.2.7
Cite as:	arXiv:2405.08497 [cs.CL]
	(or arXiv:2405.08497v1 [cs.CL] for this version)

Submission history

From: Thomas Pickard [view email]
[v1] Tue, 14 May 2024 10:54:20 GMT (8876kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.08497

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Is Less More? Quality, Quantity and Context in Idiom Processing with Natural Language Models

Submission history