DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

Min, Chen; Zhao, Dawei; Xiao, Liang; Zhao, Jian; Xu, Xinli; Zhu, Zheng; Jin, Lei; Li, Jianshu; Guo, Yulan; Xing, Junliang; Jing, Liping; Nie, Yiming; Dai, Bin

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2405

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

Authors: Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, Bin Dai

(Submitted on 7 May 2024)

Abstract: Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.

Comments:	Accepted by CVPR2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.04390 [cs.CV]
	(or arXiv:2405.04390v1 [cs.CV] for this version)

Submission history

From: Chen Min [view email]
[v1] Tue, 7 May 2024 15:14:20 GMT (4505kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.04390

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

Submission history