JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

Taein Son; Soo Won Seo; Jisong Kim; Seok Hwan Lee; Jun Won Choi

2025 AAAI AAAI 2025

JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

Abstract

Abstract Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multi-modal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models. At the heart of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive information, enabling adaptive integration of crucial features for recognizing each actor's actions. We have developed a Transformer-based architecture, the Actor-centric Multi-modal Fusion Network, specifically designed to capture the dynamic interactions among actors and their multi-modal contexts. Our evaluation on three prominent VAD benchmarks—AVA, UCF101-24, and JHMDB51-21—demonstrates that incorporating multi-modal information significantly enhances performance, setting new state-of-the-art performances in the field.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🧭 Keyword Pioneer — actor-centric aggregation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Taein Son , Soo Won Seo , Jisong Kim , Seok Hwan Lee , Jun Won Choi

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Action Recognition Computer Vision > Core AI > Multimodal Learning Machine Learning > Learning Types > Multi-Modal Learning

Keywords

scene understanding multimodal learning audio-visual learning language model visual feature multi-modal fusion video action detection actor-centric aggregation

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025