CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems

Abbas Ghaddar; David Alfonso-hermelo; Philippe Langlais; Mehdi Rezagholizadeh; Boxing Chen; Prasanna Parthasarathi

2024 ACL ACL 2024

CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems

Abstract

AbstractIn this work, we dive deep into one of the popular knowledge-grounded dialogue benchmarks that focus on faithfulness, FaithDial. We show that a significant portion of the FaithDial data contains annotation artifacts, which may bias models towards completely ignoring the conversation history. We therefore introduce CHARP, a testbed, designed for evaluating supposedly non-hallucinatory models trained on the FaithDial dataset. Our extensive analysis reveals that models primarily exhibit poor performance on CHARP due to their inability to effectively attend to and reason over the conversation history. Furthermore, the evaluation methods of FaithDial fail to capture these shortcomings, neglecting the conversational history. Our findings indicate that there is substantial room for contribution in both dataset creation and hallucination evaluation for knowledge-grounded dialogue, and that CHARP can serve as a tool for monitoring the progress in this particular research area. Data, models, and source code will be publicly available upon acceptance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Abbas Ghaddar , David Alfonso-hermelo , Philippe Langlais , Mehdi Rezagholizadeh , Boxing Chen , Prasanna Parthasarathi

Topics

Artificial Intelligence > Core AI > Interpretability Natural Language Processing > Generation > Dialogue Systems Natural Language Processing > Applications > Dialogue Systems Artificial Intelligence > Core AI > Dialogue Systems

Keywords

benchmark evaluation attention mechanism language model hallucination detection knowledge-grounded dialogue conversation history hallucination evaluation

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024