We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:

References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Robotics

Title: Audio-Visual Grounding Referring Expression for Robotic Manipulation

Abstract: Referring expressions are commonly used when referring to a specific target in people's daily dialogue. In this paper, we develop a novel task of audio-visual grounding referring expression for robotic manipulation. The robot leverages both the audio and visual information to understand the referring expression in the given manipulation instruction and the corresponding manipulations are implemented. To solve the proposed task, an audio-visual framework is proposed for visual localization and sound recognition. We have also established a dataset which contains visual data, auditory data and manipulation instructions for evaluation. Finally, extensive experiments are conducted both offline and online to verify the effectiveness of the proposed audio-visual framework. And it is demonstrated that the robot performs better with the audio-visual data than with only the visual data.
Subjects: Robotics (cs.RO)
Cite as: arXiv:2109.10571 [cs.RO]
  (or arXiv:2109.10571v1 [cs.RO] for this version)

Submission history

From: Yefei Wang [view email]
[v1] Wed, 22 Sep 2021 08:06:42 GMT (20605kb,D)

Link back to: arXiv, form interface, contact.