FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs’ Responsiveness to Human Feedback

Youquan Li; Miao Zheng; Fan Yang; Guosheng Dong; Bin CUI; Weipeng Chen; Zenan Zhou; Wentao Zhang

2025 EMNLP EMNLP 2025

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs’ Responsiveness to Human Feedback

Abstract

AbstractHuman feedback is crucial in the interactions between humans and Large Language Models (LLMs). However, existing research primarily focuses on benchmarking LLMs in single-turn dialogues. Even in benchmarks designed for multi-turn dialogues, the user utterances are often independent, neglecting the nuanced and complex nature of human feedback within real-world usage scenarios. To fill this research gap, we introduce FB-Bench, a fine-grained, multi-task benchmark designed to evaluate LLMs’ responsiveness to human feedback under real-world usage scenarios in Chinese. Drawing from the two main interaction scenarios, FB-Bench comprises 591 meticulously curated samples, encompassing eight task types, five deficiency types of response, and nine feedback types. We extensively evaluate a broad array of popular LLMs, revealing significant variations in their performance across different interaction scenarios. Further analysis indicates that task, human feedback, and deficiencies of previous responses can also significantly impact LLMs’ responsiveness. Our findings underscore both the strengths and limitations of current models, providing valuable insights and directions for future research.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — responsiveness evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Youquan Li , Miao Zheng , Fan Yang , Guosheng Dong , Bin CUI , Weipeng Chen , Zenan Zhou , Wentao Zhang

Topics

Artificial Intelligence > Core AI > Human-AI Interaction Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Natural Language Processing > Applications > Dialogue Systems Machine Learning > Learning Types > Evaluation

Keywords

benchmark evaluation human feedback multi-turn dialogue human-ai interaction dialogue system multi-task benchmark large language model responsiveness evaluation fine-grained benchmark

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025