Structured Two-Stream Attention Network for Video Question Answering

Lianli Gao; Pengpeng Zeng; Jingkuan Song; Yuan-Fang Li; Wu Liu; Tao Mei; Heng Tao Shen

2019 AAAI AAAI 2019

Structured Two-Stream Attention Network for Video Question Answering

Abstract

Abstract To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich longrange temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.

🚀 Conference Pioneer — AAAI 2019

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🐣 Hot Topic Early Bird — temporal reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Lianli Gao , Pengpeng Zeng , Jingkuan Song , Yuan-Fang Li , Wu Liu , Tao Mei , Heng Tao Shen

Topics

Computer Vision > Processing > Video Understanding Natural Language Processing > Applications > Question Answering Deep Learning > Techniques > Attention Artificial Intelligence > Core AI > Multi-Modal Learning Computer Vision > Applications > Question Answering

Keywords

visual question answering attention mechanism temporal reasoning multimodal learning video understanding video question answering

Download PDF

Related papers

Cooperative Multimodal Approach to Depression Detection in Twitter 2019

Learning to Align Question and Answer Utterances in Customer Service Conversation with Recurrent Pointer Networks 2019

Community Detection in Social Networks Considering Topic Correlations 2019

Session-Based Recommendation with Graph Neural Networks 2019

Blameworthiness in Multi-Agent Settings 2019