References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: X-volution: On the unification of convolution and self-attention
(Submitted on 4 Jun 2021 (v1), last revised 7 Jun 2021 (this version, v2))
Abstract: Convolution and self-attention are acting as two fundamental building blocks in deep neural networks, where the former extracts local image features in a linear way while the latter non-locally encodes high-order contextual relationships. Though essentially complementary to each other, i.e., first-/high-order, stat-of-the-art architectures, i.e., CNNs or transformers lack a principled way to simultaneously apply both operations in a single computational module, due to their heterogeneous computing pattern and excessive burden of global dot-product for visual tasks. In this work, we theoretically derive a global self-attention approximation scheme, which approximates a self-attention via the convolution operation on transformed features. Based on the approximated scheme, we establish a multi-branch elementary module composed of both convolution and self-attention operation, capable of unifying both local and non-local feature interaction. Importantly, once trained, this multi-branch module could be conditionally converted into a single standard convolution operation via structural re-parameterization, rendering a pure convolution styled operator named X-volution, ready to be plugged into any modern networks as an atomic operation. Extensive experiments demonstrate that the proposed X-volution, achieves highly competitive visual understanding improvements (+1.2% top-1 accuracy on ImageNet classification, +1.7 box AP and +1.5 mask AP on COCO detection and segmentation).
Submission history
From: Hang Wang [view email][v1] Fri, 4 Jun 2021 04:32:02 GMT (594kb,D)
[v2] Mon, 7 Jun 2021 09:03:46 GMT (594kb,D)
Link back to: arXiv, form interface, contact.