Global Vision Transformer Pruning with Hessian-Aware Saliency

Yang, Huanrui; Yin, Hongxu; Shen, Maying; Molchanov, Pavlo; Li, Hai; Kautz, Jan

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2110

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Global Vision Transformer Pruning with Hessian-Aware Saliency

Authors: Huanrui Yang, Hongxu Yin, Maying Shen, Pavlo Molchanov, Hai Li, Jan Kautz

(Submitted on 10 Oct 2021 (v1), last revised 29 Mar 2023 (this version, v2))

Abstract: Transformers yield state-of-the-art results across many tasks. However, their heuristically designed architecture impose huge computational costs during inference. This work aims on challenging the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage, where we redistribute the parameters both across transformer blocks and between different structures within the block via the first systematic attempt on global structural pruning. Dealing with diverse ViT structural components, we derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter redistribution that utilizes parameters more efficiently. On ImageNet-1K, NViT-Base achieves a 2.6x FLOPs reduction, 5.1x parameter reduction, and 1.9x run-time speedup over the DeiT-Base model in a near lossless manner. Smaller NViT variants achieve more than 1% accuracy gain at the same throughput of the DeiT Small/Tiny variants, as well as a lossless 3.3x parameter reduction over the SWIN-Small model. These results outperform prior art by a large margin. Further analysis is provided on the parameter redistribution insight of NViT, where we show the high prunability of ViT models, distinct sensitivity within ViT block, and unique parameter distribution trend across stacked ViT blocks. Our insights provide viability for a simple yet effective parameter redistribution rule towards more efficient ViTs for off-the-shelf performance boost.

Comments:	Accepted as a conference paper at CVPR 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2110.04869 [cs.CV]
	(or arXiv:2110.04869v2 [cs.CV] for this version)

Submission history

From: Huanrui Yang [view email]
[v1] Sun, 10 Oct 2021 18:04:59 GMT (1718kb,D)
[v2] Wed, 29 Mar 2023 21:00:43 GMT (4915kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2110.04869

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Global Vision Transformer Pruning with Hessian-Aware Saliency

Submission history