CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

Quan Tu; Shilong Fan; Zihang Tian; Tianhao Shen; Shuo Shang; Xin Gao; Rui Yan

2024 ACL ACL 2024

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

Abstract

AbstractRecently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 11,376 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. To facilitate the convenient evaluation for these subjective metrics in CharacterEval, we further developed CharacterRM, a role-playing reward model based on human annotations, which has a higher correlation with human judgment compared to GPT-4. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — conversational agent

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Quan Tu , Shilong Fan , Zihang Tian , Tianhao Shen , Shuo Shang , Xin Gao , Rui Yan

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Generation > Dialogue Systems Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Applications > Dialogue Systems Machine Learning > Core Methods > Evaluation

Keywords

reward model evaluation benchmark conversational agent role-playing agent dialogue evaluation chinese language chinese benchmark

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024