SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning

Long Chen; Hanwang Zhang; Jun Xiao; Liqiang Nie; Jian Shao; Wei Liu; Tat-Seng Chua

2017 CVPR CVPR 2017

SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning

Abstract

Visual attention has been successfully applied in structural prediction tasks such as visual captioning and question answering. Existing visual attention models are generally spatial, i.e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN encoding an input image. However, we argue that such spatial attention does not necessarily conform to the attention mechanism --- a dynamic feature extractor that combines contextual fixations over time, as CNN features are naturally spatial, channel-wise and multi-layer. In this paper, we introduce a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN. In the task of image captioning, SCA-CNN dynamically modulates the sentence generation context in multi-layer feature maps, encoding where (i.e., attentive spatial locations at multiple layers) and what (i.e., attentive channels) the visual attention is. We evaluate the proposed SCA-CNN architecture on three benchmark image captioning datasets: Flickr8K, Flickr30K, and MSCOCO. It is consistently observed that SCA-CNN significantly outperforms state-of-the-art visual attention-based image captioning methods.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — channel attention

🐣 Hot Topic Early Bird — visual attention

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Long Chen , Hanwang Zhang , Jun Xiao , Liqiang Nie , Jian Shao , Wei Liu , Tat-Seng Chua

Topics

Computer Vision > Generation > Image Captioning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Techniques > Attention

Keywords

image captioning channel attention visual attention convolutional neural network sentence generation spatial attention channel-wise attention

Download PDF

Related papers

Deep Outdoor Illumination Estimation 2017

SRN: Side-output Residual Network for Object Symmetry Detection in the Wild 2017

Weakly Supervised Semantic Segmentation Using Web-Crawled Videos 2017

FASON: First and Second Order Information Fusion Network for Texture Recognition 2017

Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization 2017