QualEval: Qualitative Evaluation for Model Improvement

Vishvak Murahari; Ameet Deshpande; Peter Clark; Tanmay Rajpurohit; Ashish Sabharwal; Karthik Narasimhan; Ashwin Kalyan

2024 NAACL NAACL 2024

QualEval: Qualitative Evaluation for Model Improvement

Abstract

AbstractQuantitative evaluation metrics have been pivotal in gauging the advancements of AI systems like large language models (LLMs).However, due to the intricate nature of real-world tasks, a single scalar to quantify and compare performance trivializes the fine-grained nuances of model behavior. Additionally, metrics do not yield actionable diagnostics for model improvement, thus requiring extensive manual efforts of scientists, involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which uses automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are supported by a dashboard report with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace and quality of model development by eliminating the need of arduous manual analysis, thus serving as a data-scientist-in-a-box.

🐣 Hot Topic Early Bird — automated evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Vishvak Murahari , Ameet Deshpande , Peter Clark , Tanmay Rajpurohit , Ashish Sabharwal , Karthik Narasimhan , Ashwin Kalyan

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Interpretability

Keywords

automated evaluation dialogue system large language model qualitative evaluation model improvement

Download PDF

Related papers

Working Alliance Transformer for Psychotherapy Dialogue Classification 2024

Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences 2024

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study 2024

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation 2024

Extractive Summarization with Text Generator 2024