The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

Wenqi Jia; Miao Liu; Hao Jiang; Ishwarya Ananthabhotla; James M. Rehg; Vamsi Krishna Ithapu; Ruohan Gao

2024 CVPR CVPR 2024

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

Abstract

In recent years the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer we introduce the Ego-Exocentric Conversational Graph Prediction problem marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework---Audio-Visual Conversational Attention (AV-CONV) for the joint prediction of conversation behaviors---speaking and listening---for both the camera wearer as well as all other social partners present in the egocentric video. Specifically we adopt the self-attention mechanism to model the representations across-time across-subjects and across-modalities. To validate our method we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our \href https://vjwq.github.io/AV-CONV/ Project Page .

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wenqi Jia , Miao Liu , Hao Jiang , Ishwarya Ananthabhotla , James M. Rehg , Vamsi Krishna Ithapu , Ruohan Gao

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Deep Learning > Architectures > Graph Neural Networks Computer Vision > Domain-Specific > Egocentric Vision Machine Learning > Learning Types > Multi-Modal Learning

Keywords

self-attention mechanism multimodal learning multi-modal learning egocentric video conversation analysis conversational graph

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024