We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:


References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Computer Vision and Pattern Recognition

Title: PatchFormer: A Versatile 3D Transformer Based on Patch Attention

Abstract: The 3D vision community is witnesses a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major 3D learning benchmarks. However, existing 3D Transformers need to generate a large attention map, which has quadratic complexity (both in space and time) with respect to input size. To solve this shortcoming, we introduce patch-attention to adaptively learn a much smaller set of bases upon which the attention maps are computed. By a weighted summation upon these bases, patch-attention not only captures the global shape context but also achieves linear complexity to input size. In addition, we propose a lightweight Multi-scale Attention (MSA) block to build attentions among features of different scales, providing the model with multi-scale features. Based on these proposed modules, we construct our neural architecture called PatchFormer. Extensive experiments demonstrate that our network achieves strong accuracy on general 3D recognition tasks with 7.3x speed-up than previous 3D Transformers.
Comments: 10 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2111.00207 [cs.CV]
  (or arXiv:2111.00207v1 [cs.CV] for this version)

Submission history

From: Cheng Zhang [view email]
[v1] Sat, 30 Oct 2021 08:39:55 GMT (4075kb,D)
[v2] Thu, 2 Dec 2021 06:54:02 GMT (4074kb,D)
[v3] Thu, 24 Mar 2022 09:15:14 GMT (4181kb,D)

Link back to: arXiv, form interface, contact.