Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities

Ken Shi; Gerald Penn

2025 COLING COLING 2025

Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities

Abstract

AbstractIn this paper, we introduce the concept of Semantic Masking, where semantically coherent surrounding text (the haystack) interferes with the retrieval and comprehension of specific information (the needle) embedded within it. We propose the Needle-in-a-Haystack-QA Test, an evaluation pipeline that assesses LLMs’ long-text capabilities through question answering, explicitly accounting for the Semantic Masking effect. We conduct experiments to demonstrate that Semantic Masking significantly impacts LLM performance more than text length does. By accounting for Semantic Masking, we provide a more accurate assessment of LLMs’ true proficiency in utilizing extended contexts, paving the way for future research to develop models that are not only capable of handling longer inputs but are also adept at navigating complex semantic landscapes.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — long-text capability

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ken Shi , Gerald Penn

Topics

Natural Language Processing > Understanding > Semantic Analysis Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Machine Learning > Optimization & Theory > Evaluation Machine Learning > Learning Types > Evaluation Deep Learning > Optimization & Theory > Optimization

Keywords

question answering long context context retrieval long-text capability semantic masking retrieval evaluation

Download PDF

Related papers

Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection 2025

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution 2025

Positive Text Reframing under Multi-strategy Optimization 2025

RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration 2025

Two-stage Incomplete Utterance Rewriting on Editing Operation 2025