2025 NAACL NAACL 2025

EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild

Abstract

AbstractPredicting when to initiate speech in real-world environments remains a fundamental challenge for conversational agents. We introduce , a novel framework for real-time speech initiation prediction in egocentric streaming video. By modeling the conversation from the speaker’s first-person viewpoint, is tailored for human-like interactions in which a conversational agent must continuously observe its environment and dynamically decide when to talk.Our approach bridges the gap between simplified experimental setups and complex natural conversations by integrating four key capabilities: (1) first-person perspective, (2) RGB processing, (3) online processing, and (4) untrimmed video processing. We also present YT-Conversation, a diverse collection of in-the-wild conversational videos from YouTube, as a resource for large-scale pretraining. Experiments on EasyCom and Ego4D demonstrate that outperforms random and silence-based baselines in real time. Our results also highlight the importance of multimodal input and context length in effectively deciding when to speak. Code and data are available at website.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio