ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi; Krisztian Balog; Sally Goldman; Avi Caciularu; Guy Tennenholtz; Jihwan Jeong; Amir Globerson; Craig Boutilier

2026 EACL EACL 2026

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Abstract

AbstractThe promise of *LLM-based user simulators* to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce *ConvApparel*, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol, using both "good" and "bad" recommenders, enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction.We propose a comprehensive validation framework that combines *statistical alignment*, a *human-likeness score*, and *counterfactual validation* to test for generalization.Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — counterfactual validation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ofer Meshi , Krisztian Balog , Sally Goldman , Avi Caciularu , Guy Tennenholtz , Jihwan Jeong , Amir Globerson , Craig Boutilier

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Human-AI Interaction Natural Language Processing > Generation > Dialogue Systems

Keywords

dialogue system user simulator conversational recommender counterfactual validation human-ai conversation

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026