MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Kaustubh Deshpande; Ved Sirdeshmukh; Johannes Baptist Mols; Lifeng Jin; Ed-Yeremai Hernandez-Cardona; Dean Lee; Jeremy Kritz; Willow E. Primack; Summer Yue; Chen Xing

2025 ACL ACL 2025

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Abstract

AbstractWe present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time.We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (October 2024) achieving just a 41.4% average accuracy.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kaustubh Deshpande , Ved Sirdeshmukh , Johannes Baptist Mols , Lifeng Jin , Ed-Yeremai Hernandez-Cardona , Dean Lee , Jeremy Kritz , Willow E. Primack , Summer Yue , Chen Xing

Topics

Artificial Intelligence > Core AI > Human-AI Interaction Natural Language Processing > Generation > Dialogue Systems Natural Language Processing > Applications > Machine Reading Comprehension

Keywords

evaluation benchmark multi-turn conversation context reasoning large language model

Download PDF

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights 2025

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025

Structural Deep Encoding for Table Question Answering 2025

Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating 2025

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Abstract

Authors

Topics

Keywords

Related papers