Attention Bottlenecks for Multimodal Fusion

Arsha Nagrani; Shan Yang; Anurag Arnab; Aren Jansen; Cordelia Schmid; Chen Sun

2021 NIPS NeurIPS 2021

Attention Bottlenecks for Multimodal Fusion

Abstract

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks.A common approach for building multimodal models is to simply combine multiple of these modality-specific architectures using late-stage fusion of final representations or predictions ('late-fusion').Instead, we introduce a novel transformer based architecture that uses 'attention bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, these bottlenecks force information between different modalities to pass through a small number of '`bottleneck' latent units, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — audio-visual classification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Arsha Nagrani , Shan Yang , Anurag Arnab , Aren Jansen , Cordelia Schmid , Chen Sun

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Deep Learning > Techniques > Model Architecture Deep Learning > Models > Transformers Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Techniques > Attention

Keywords

transformer architecture attention mechanism multimodal learning multimodal fusion late fusion attention bottleneck audio-visual classification latent unit

Download PDF

Related papers

Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data 2021

On Model Calibration for Long-Tailed Object Detection and Instance Segmentation 2021

Test-Time Personalization with a Transformer for Human Pose Estimation 2021

NTopo: Mesh-free Topology Optimization using Implicit Neural Representations 2021

Scalable Intervention Target Estimation in Linear Models 2021