Visually grounded few-shot word acquisition with fewer shots

Nortje, Leanne; van Niekerk, Benjamin; Kamper, Herman

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2305

Computer Science > Computation and Language

Title: Visually grounded few-shot word acquisition with fewer shots

Authors: Leanne Nortje, Benjamin van Niekerk, Herman Kamper

(Submitted on 25 May 2023)

Abstract: We propose a visually grounded speech model that acquires new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than any existing approach.

Comments:	Accepted at Interspeech 2023
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2305.15937 [cs.CL]
	(or arXiv:2305.15937v1 [cs.CL] for this version)

Submission history

From: Leanne Nortje [view email]
[v1] Thu, 25 May 2023 11:05:54 GMT (714kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2305.15937

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Visually grounded few-shot word acquisition with fewer shots

Submission history