2025 ACL ACL 2025

EgoNormia: Benchmarking Physical-Social Norm Understanding

Abstract

AbstractHuman activity is moderated by norms; however, supervision for normative reasoning is sparse, particularly where norms are physically- or socially-grounded. We thus present EgoNormia | 𝜖 |, comprising 1,853 (200 for EgoNormia-verified) multiple choice questions (MCQs) grounded within ego-centric videos of human interactions, enabling the evaluation and improvement of normative reasoning in vision-language models (VLMs). spans seven norm categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline to generate grounded MCQs from raw egocentric video. Our work demonstrates that current state-of-the-art VLMs lack robust grounded norm understanding, scoring a maximum of 54% on EgoNormia and 58% on EgoNormia-verified, with performance across norm categories indicating significant risks of safety and privacy when VLMs are used in real-world agents. We additionally explore methods for improving normative understanding, demonstrating a naive retrieval-based generation (RAG) method using can enhance normative reasoning in VLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Interdisciplinary
🧭 Keyword Pioneer — norm understanding
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Reinforcement Learning