SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

Tianyu Yang; Yiyang Nan; Lisen Dai; Zhenwen Liang; Yapeng Tian; Xiangliang Zhang

2024 EMNLP EMNLP 2024

SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

Abstract

AbstractAudio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual information in videos. A significant challenge is interpreting complex multi-modal scenes, which include both visual objects and sound sources, and connecting them to the given question. In this paper, we introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for AVQA. SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. It streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to identify answers in multi-modal scenes. Extensive experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods. We will release our source code and pre-trained models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — source-aware semantic representation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tianyu Yang , Yiyang Nan , Lisen Dai , Zhenwen Liang , Yapeng Tian , Xiangliang Zhang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Natural Language Processing > Applications > Question Answering Computer Vision > Core AI > Multimodal Learning Natural Language Processing > Applications > Visual Question Answering Deep Learning > Learning Types > Multi-Modal Learning

Keywords

attention mechanism multimodal learning video understanding semantic representation temporal attention spatial attention audio-visual question answering source-aware semantic representation

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024