S$^2$-MLP: Spatial-Shift MLP Architecture for Vision

Yu, Tan; Li, Xu; Cai, Yunfeng; Sun, Mingming; Li, Ping

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2106

Computer Science > Computer Vision and Pattern Recognition

Title: S$^2$-MLP: Spatial-Shift MLP Architecture for Vision

Authors: Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li

(Submitted on 14 Jun 2021 (v1), last revised 23 Jun 2021 (this version, v2))

Abstract: Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNNs. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that the token-mixing MLP is a variant of the depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S$^2$-MLP). Different from MLP-Mixer, our S$^2$-MLP only contains channel-mixing MLP. We utilize a spatial-shift operation for communications between patches. It has a local reception field and is spatial-agnostic. It is parameter-free and efficient for computation. The proposed S$^2$-MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S$^2$-MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2106.07477 [cs.CV]
	(or arXiv:2106.07477v2 [cs.CV] for this version)

Submission history

From: Ping Li [view email]
[v1] Mon, 14 Jun 2021 15:05:11 GMT (125kb,D)
[v2] Wed, 23 Jun 2021 17:58:04 GMT (129kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2106.07477

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: S$^2$-MLP: Spatial-Shift MLP Architecture for Vision

Submission history