RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Rujun Han; Yuhao Zhang; Peng Qi; Yumo Xu; Jenyuan Wang; Lan Liu; William Yang Wang; Bonan Min; Vittorio Castelli

2024 EMNLP EMNLP 2024

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Abstract

AbstractQuestion answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA’s answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM’s answers are preferred to LFRQA’s answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rujun Han , Yuhao Zhang , Peng Qi , Yumo Xu , Jenyuan Wang , Lan Liu , William Yang Wang , Bonan Min , Vittorio Castelli

Topics

Machine Learning > Application Areas > Domain Generalization Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Applications > Question Answering Machine Learning > Learning Types > Domain Generalization Deep Learning > Learning Types > Retrieval-Augmented Generation

Keywords

retrieval augmented generation domain generalization domain adaptation question answering information retrieval language model evaluation benchmark cross-domain generalization cross-domain evaluation domain robustness long-form answer

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024