Making Visual Dialogue More Engaging: A New Task, Method, and Metric

Guanghui Ye; Huan Zhao; Yingxue Gao; Zhixue Zhao; Kehan Wang; Xupeng Zha; Zhihua Jiang

2026 AAAI AAAI 2026

Making Visual Dialogue More Engaging: A New Task, Method, and Metric

Abstract

Abstract Large language model (LLM)-based visual dialogue (VD) systems have made response generation for image-grounded conversations more correct and coherent. However, user engagement - the extent to which a user is interested, emotionally involved, and willing to continue the conversation - remains a challenge. To fully explore engaging VD, we propose: (i) a new task named Audio-enhanced VD (AVD), which introduces additional audio dialogue contexts that can more vividly convey the speaker's emotions as input, with the aim of generating correct but more engaging dialogue responses. Specifically, we employ a text-to-speech model as the modality translator to generate the paired acoustic utterances from the inputting textual utterances; (ii) an accompanying approach named Visually-grounded and Interleaved Text-Audio Dialogue Modeling (VITA-DM), which utilizes both image-grounded information and interleaved text-audio utterances for visual dialogue modeling, differentiating from previous multi-modal LLM (MLLM)-based methods that normally model text and audio modalities separately. We also present three pre-training tasks to better learn multi-modal interactions across language, vision, and audio; (iii) a novel metric named Multi-Modal Engagement (MME), which fills the gap of engagement estimation in VD and can provide a fine-grained assessment along emotional, attentional, and reply engagement dimensions (EE, AE, RE). We experiment on two popular datasets and provide extensive evaluations (automatic, engagement-specific, and human), supporting the validity of our approach. Furthermore, based on empirical results that reveal that emotions contribute the most to engagement, we justify our emphasis on the emotional aspect throughout the definition, solution, and evaluation of our task.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — audio-enhanced dialogue

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Guanghui Ye , Huan Zhao , Yingxue Gao , Zhixue Zhao , Kehan Wang , Xupeng Zha , Zhihua Jiang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Generation > Dialogue Systems

Keywords

dialogue generation emotion recognition multimodal interaction visual dialogue engagement estimation audio-enhanced dialogue multimodal engagement text-audio dialogue

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026