GEAR: A Scalable and Interpretable Evaluation Framework for RAG-Based Car Assistant Systems

Niloufar Beyranvand; Hamidreza Dastmalchi; Aijun An; Heidar Davoudi; Winston Chan; Ron DiCarlantonio

2025 EMNLP EMNLP 2025

GEAR: A Scalable and Interpretable Evaluation Framework for RAG-Based Car Assistant Systems

Abstract

AbstractLarge language models (LLMs) increasingly power car assistants, enabling natural language interaction for tasks such as maintenance, troubleshooting, and operational guidance. While retrieval-augmented generation (RAG) improves grounding using vehicle manuals, evaluating response quality remains a key challenge. Traditional metrics like BLEU and ROUGE fail to capture critical aspects such as factual accuracy and information coverage. We propose GEAR, a fully automated, reference-based evaluation framework for car assistant systems. GEAR uses LLMs as evaluators to compare assistant responses against ground-truth counterparts, assessing coverage, correctness, and other dimensions of answer quality. To enable fine-grained evaluation, both responses are decomposed into key facts and labeled as essential, optional, or safety-critical using LLMs. The evaluator then determines which of these facts are correct and covered. Experiments show that GEAR aligns closely with human annotations, offering a scalable and reliable solution for evaluating car assistants.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — car assistant system

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Niloufar Beyranvand , Hamidreza Dastmalchi , Aijun An , Heidar Davoudi , Winston Chan , Ron DiCarlantonio

Topics

Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Applications > Question Answering Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Efficient Computing Machine Learning > Learning Types > Retrieval-Augmented Generation Natural Language Processing > Generation > Retrieval-Augmented Generation Machine Learning > Learning Types > Large Language Models

Keywords

factual accuracy question answering evaluation framework retrieval-augmented generation answer quality information coverage fact checking llm evaluator large language model car assistant system car assistant

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025