Mulan: A Multi-Level Alignment Model for Video Question Answering

Yu Fu; Cong Cao; Yuling Yang; Yuhai Lu; Fangfang Yuan; Dakui Wang; Yanbing Liu

2023 EMNLP EMNLP 2023

Mulan: A Multi-Level Alignment Model for Video Question Answering

Abstract

AbstractVideo Question Answering (VideoQA) aims to answer questions about the visual content of a video. Current methods mainly focus on improving joint representations of video and text. However, these methods pay little attention to the fine-grained semantic interaction between video and text. In this paper, we propose Mulan: a Multi-Level Alignment Model for Video Question Answering, which establishes alignment between visual and textual modalities at the object-level, frame-level, and video-level. Specifically, for object-level alignment, we propose a mask-guided visual feature encoding method and a visual-guided text description method to learn fine-grained spatial information. For frame-level alignment, we introduce the use of visual features from individual frames, combined with a caption generator, to learn overall spatial information within the scene. For video-level alignment, we propose an expandable ordinal prompt for textual descriptions, combined with visual features, to learn temporal information. Experimental results show that our method outperforms the state-of-the-art methods, even when utilizing the smallest amount of extra visual-language pre-training data and a reduced number of trainable parameters.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — visual-text alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yu Fu , Cong Cao , Yuling Yang , Yuhai Lu , Fangfang Yuan , Dakui Wang , Yanbing Liu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Processing > Video Understanding Natural Language Processing > Applications > Question Answering Computer Vision > Core AI > Multimodal Learning Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

temporal modeling object detection question answering multimodal learning multi-modal learning video understanding video question answering spatial information fine-grained alignment visual-text alignment visual-textual alignment multi-level alignment

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023