Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

Sangmin Lee; Bolin Lai; Fiona Ryan; Bikram Boote; James M. Rehg

2024 CVPR CVPR 2024

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

Abstract

Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently they are limited in modeling the intricate dynamics of multi-party interactions. In this paper we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification pronoun coreference resolution and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Interdisciplinary and Natural Language Processing

🧭 Keyword Pioneer — multimodal social interaction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sangmin Lee , Bolin Lai , Fiona Ryan , Bikram Boote , James M. Rehg

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Understanding > Coreference Resolution Interdisciplinary > Social > Affective Computing Computer Vision > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

natural language processing multimodal learning visual reasoning coreference resolution social interaction social reasoning multimodal social interaction pronoun coreference language-visual alignment language-visual representation

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024