DMT-RoleBench: A Dynamic Multi-Turn Dialogue Based Benchmark for Role-Playing Evaluation of Large Language Model and Agent

Dingbo Yuan; Yipeng Chen; Guodong Liu; Chenchen Li; Chengfu Tang; Dongxu Zhang; Zhenkui Wang; Xudong Wang; Song Liu

2025 AAAI AAAI 2025

DMT-RoleBench: A Dynamic Multi-Turn Dialogue Based Benchmark for Role-Playing Evaluation of Large Language Model and Agent

Abstract

Abstract Recent years have witnessed a profound evolution in the abilities of Large Language Model, which has significantly boosted the proliferation of role-playing agents and platforms. Nonetheless, there is a conspicuous absence of systematic and comprehensive evaluations of role-playing abilities which are truly aligned with users' interaction scenarios in real-world. To address this gap, we have devised DMT-RoleBench, a benchmark designed to evaluate the role-playing abilities of large language models and agents based on dynamic multi-turn dialogues. Compared with existed role-playing benchmarks, DMT-RoleBench boasts several principal advantages: (1) It contains a more diverse role types and system prompts of different formats. (2) We propose an innovative evaluation paradigm to assess role-playing abilities based on dynamically generating multi-turn dialogues constrained by specific evaluation intents and topics, which is well aligned with users' interaction scenarios in real-world. (3) We define a three-tiered metric system and provide DMT-RM, which is a reward model aligned with human annotations, to annotate the dialogues. And we propose DMT-Score to calculate the final scores based on the annotated dialogues. Our experiments and analysis of leading models equipped with role-playing abilities have demonstrated the effectiveness of DMT-RoleBench.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — role-playing benchmark

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dingbo Yuan , Yipeng Chen , Guodong Liu , Chenchen Li , Chengfu Tang , Dongxu Zhang , Zhenkui Wang , Xudong Wang , Song Liu

Topics

Artificial Intelligence > Core AI > Agent Systems Natural Language Processing > Generation > Dialogue Systems Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Dialogue Systems

Keywords

benchmark evaluation language model evaluation multi-turn dialogue role-playing agent agent evaluation dialogue system large language model role-playing benchmark

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025