Long Context Benchmark for the Russian Language

Igor Churin; Murat Apishev; Maria Tikhonova; Denis Shevelev; Aydar Bulatov; Yuri Kuratov; Sergei Averkiev; Alena Fenogenova

2025 EMNLP EMNLP 2025

Long Context Benchmark for the Russian Language

Abstract

AbstractRecent progress in Natural Language Processing (NLP) has driven the creation of Large Language Models (LLMs) capable of tackling a vast range of tasks. A critical property of these models is their ability to handle large documents and process long token sequences, which has fostered the need for a robust evaluation methodology for long-text scenarios. To meet this requirement in the context of the Russian language, we present our benchmark consisting of 18 datasets designed to assess LLM performance in tasks such as information retrieval, knowledge extraction, machine reading, question answering, and reasoning. These datasets are categorized into four levels of complexity, enabling model evaluation across context lengths up to 128k tokens. To facilitate further research, we provide open-source datasets, a codebase, and a public leaderboard associated with the benchmark.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Igor Churin , Murat Apishev , Maria Tikhonova , Denis Shevelev , Aydar Bulatov , Yuri Kuratov , Sergei Averkiev , Alena Fenogenova

Topics

Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Applications > Machine Reading Comprehension Natural Language Processing > Applications > Question Answering

Keywords

question answering information retrieval language model evaluation machine reading comprehension long context russian language

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025