Visual Transformers: Where Do Transformers Really Belong in Vision Models?

Bichen Wu; Chenfeng Xu; Xiaoliang Dai; Alvin Wan; Peizhao Zhang; Zhicheng Yan; Masayoshi Tomizuka; Joseph E. Gonzalez; Kurt Keutzer; Peter Vajda

2021 ICCV ICCV 2021

Visual Transformers: Where Do Transformers Really Belong in Vision Models?

Abstract

A recent trend in computer vision is to replace convolutions with transformers. However, the performance gain of transformers is attained at a steep cost, requiring GPU years and hundreds of millions of samples for training. This excessive resource usage compensates for a misuse of transformers: Transformers densely model relationships between its inputs -- ideal for late stages of a neural network, when concepts are sparse and spatially-distant, but extremely inefficient for early stages of a network, when patterns are redundant and localized. To address these issues, we leverage the respective strengths of both operations, building convolution-transformer hybrids. Critically, in sharp contrast to pixel-space transformers, our Visual Transformer (VT) operates in a semantic token space, judiciously attending to different image parts based on context. Our VTs significantly outperforms baselines: On ImageNet, our VT-ResNets outperform convolution-only ResNet by 4.6 to 7 points and transformer-only ViT-B by 2.6 points with 2.5 times fewer FLOPs, 2.1 times fewer parameters. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.

❓ The Questioner

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — semantic token

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bichen Wu , Chenfeng Xu , Xiaoliang Dai , Alvin Wan , Peizhao Zhang , Zhicheng Yan , Masayoshi Tomizuka , Joseph E. Gonzalez , Kurt Keutzer , Peter Vajda

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Deep Learning > Architectures > Neural Networks

Keywords

efficient computing hybrid architecture semantic token neural network visual transformer

Download PDF

Related papers

Spatial-Temporal Transformer for Dynamic Scene Graph Generation 2021

ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators 2021

A Broad Study on the Transferability of Visual Representations With Contrastive Learning 2021

Query Adaptive Few-Shot Object Detection With Heterogeneous Graph Convolutional Networks 2021

Self-Supervised Neural Networks for Spectral Snapshot Compressive Imaging 2021