CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario

Duan, Zhizhao; Cheng, Hao; Xu, Duo; Wu, Xi; Zhang, Xiangxie; Ye, Xi; Xie, Zhen

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2405

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario

Authors: Zhizhao Duan, Hao Cheng, Duo Xu, Xi Wu, Xiangxie Zhang, Xi Ye, Zhen Xie

(Submitted on 6 May 2024)

Abstract: In the vast and dynamic landscape of urban settings, Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention. This paper introduces CityLLaVA, a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios. CityLLaVA enhances model comprehension and prediction accuracy through (1) employing bounding boxes for optimal visual data preprocessing, including video best-view selection and visual prompt engineering during both training and testing phases; (2) constructing concise Question-Answer sequences and designing textual prompts to refine instruction comprehension; (3) implementing block expansion to fine-tune large VLMs efficiently; and (4) advancing prediction accuracy via a unique sequential questioning-based prediction augmentation. Demonstrating top-tier performance, our method achieved a benchmark score of 33.4308, securing the leading position on the leaderboard. The code can be found: this https URL

Comments:	Accepted by AICITY2024 Workshop Track2 at CVPR2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.03194 [cs.CV]
	(or arXiv:2405.03194v1 [cs.CV] for this version)

Submission history

From: Hao Cheng [view email]
[v1] Mon, 6 May 2024 06:38:49 GMT (3897kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.03194

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario

Submission history