Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Min Peng; Chongyang Wang; Yu Shi; Xiang-Dong Zhou

2023 AAAI AAAI 2023

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Abstract

Abstract This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid. Code available at: https://github.com/Trunpm/PMT-AAAI23.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — pyramidal multimodal transformer

🐣 Hot Topic Early Bird — spatio-temporal modeling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Min Peng , Chongyang Wang , Yu Shi , Xiang-Dong Zhou

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers Computer Vision > Processing > Video Understanding Natural Language Processing > Applications > Question Answering Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning Computer Vision > Generation > Visual Question Answering

Keywords

spatio-temporal modeling efficient computing vision-language model spatial-temporal modeling video question answering multimodal transformer text-to-video retrieval pyramidal multimodal transformer video-language interaction spatio-temporal scale pyramidal architecture

Download PDF

Related papers

A Model-Agnostic Heuristics for Selective Classification 2023

Tackling Safe and Efficient Multi-Agent Reinforcement Learning via Dynamic Shielding (Student Abstract) 2023

Head-Free Lightweight Semantic Segmentation with Linear Transformer 2023

Hierarchical ConViT with Attention-Based Relational Reasoner for Visual Analogical Reasoning 2023

Deep Spiking Neural Networks with High Representation Similarity Model Visual Pathways of Macaque and Mouse 2023