Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps

Yijiong Yu; Zhixiao Qi; Yongfeng Huang; Wei Wang; Weifeng.liu; Ran Chen; Ji Pei

2025 EMNLP EMNLP 2025

Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps

Abstract

AbstractLong-context language models (LCLMs), characterized by their extensive context window, are becoming popular. However, despite the fact that they are nearly perfect at standard long-context retrieval tasks, our evaluations demonstrate they fail in some basic cases. Later, we find they can be well addressed with a sufficient number of reasoning steps, guided by specific CoT prompts. This result emphasizes the potential necessity of solving specific long-context tasks using long-CoT methods, while previous long-context benchmarks always ignore the necessity of long reasoning for long-context tasks and treat them as direct QA tasks. Our code and datasets are available at https://github.com/yuyijiong/hard_retrieval_for_llm

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Yijiong Yu , Zhixiao Qi , Yongfeng Huang , Wei Wang , Weifeng.liu , Ran Chen , Ji Pei

Topics

Artificial Intelligence > Learning Paradigms > Transfer Learning Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Resources & Methods > Large Language Models

Keywords

information retrieval zero-shot evaluation long context chain of thought prompting reasoning step retrieval evaluation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025