LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Shang, Yuzhang; Cai, Mu; Xu, Bingxin; Lee, Yong Jae; Yan, Yan

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2403

Computer Science > Computer Vision and Pattern Recognition

Title: LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Authors: Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan

(Submitted on 22 Mar 2024 (v1), last revised 12 Apr 2024 (this version, v4))

Abstract: Large Multimodal Models (LMMs) have shown significant reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically use a fixed amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which increase the number of visual tokens significantly. However, due to the design of the Transformer architecture, computational costs associated with these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism and find, similar to prior work, that many visual tokens are spatially redundant. Based on this, we propose PruMerge, a novel adaptive visual token reduction approach, which largely reduces the number of visual tokens while maintaining comparable model performance. We first select the unpruned visual tokens based on their similarity to class tokens and spatial tokens. We then cluster the pruned tokens based on key similarity and merge the clustered tokens with the unpruned tokens to supplement their information. Empirically, when applied to LLaVA-1.5, our approach can compress the visual tokens by 18 times on average, and achieve comparable performance across diverse visual question-answering and reasoning tasks. Code and checkpoints are at this https URL

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2403.15388 [cs.CV]
	(or arXiv:2403.15388v4 [cs.CV] for this version)

Submission history

From: Yuzhang Shang [view email]
[v1] Fri, 22 Mar 2024 17:59:52 GMT (518kb,D)
[v2] Mon, 25 Mar 2024 17:59:55 GMT (518kb,D)
[v3] Mon, 1 Apr 2024 14:08:06 GMT (537kb,D)
[v4] Fri, 12 Apr 2024 17:34:29 GMT (537kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2403.15388

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Submission history