Towards Multilingual Audio-Visual Question Answering

Orchid Chetia Phukan; Priyabrata Mallick; Swarup Ranjan Behera; Aalekhya Satya Narayani; Arun Balaji Buduru; Rajesh Sharma

2024 INTERSPEECH INTERSPEECH 2024

Towards Multilingual Audio-Visual Question Answering

Abstract

In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Orchid Chetia Phukan , Priyabrata Mallick , Swarup Ranjan Behera , Aalekhya Satya Narayani , Arun Balaji Buduru , Rajesh Sharma

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Question Answering Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

machine translation audio-visual learning video understanding foundation model multilingual question answering

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024