Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Tan, Sinan; Ge, Mengmeng; Guo, Di; Liu, Huaping; Sun, Fuchun

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2201

Computer Science > Computer Vision and Pattern Recognition

Title: Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Authors: Sinan Tan, Mengmeng Ge, Di Guo, Huaping Liu, Fuchun Sun

(Submitted on 26 Jan 2022)

Abstract: In the Vision-and-Language Navigation task, the embodied agent follows linguistic instructions and navigates to a specific goal. It is important in many practical scenarios and has attracted extensive attention from both computer vision and robotics communities. However, most existing works only use RGB images but neglect the 3D semantic information of the scene. To this end, we develop a novel self-supervised training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation. Specifically, a region query task is designed as the pretext task, which predicts the presence or absence of objects of a particular class in a specific 3D region. Then, we construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs. Experiments show that the proposed approach achieves success rates of 68% and 66% on the validation unseen and test unseen splits of the R2R dataset respectively, which are superior to most of RGB-based methods utilizing vision-language transformers.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Cite as:	arXiv:2201.10788 [cs.CV]
	(or arXiv:2201.10788v1 [cs.CV] for this version)

Submission history

From: Sinan Tan [view email]
[v1] Wed, 26 Jan 2022 07:43:47 GMT (15014kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2201.10788

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Submission history