SYNC: A Synthetic Long-Context Understanding Benchmark for Controlled Comparisons of Model Capabilities

Shuyang Cao; Kaijian Zou; Lu Wang

2025 EMNLP EMNLP 2025

SYNC: A Synthetic Long-Context Understanding Benchmark for Controlled Comparisons of Model Capabilities

Abstract

AbstractRecently, researchers have turned to synthetic tasks for evaluation of large language models’ long-context capabilities, as they offer more flexibility than realistic benchmarks in scaling both input length and dataset size. However, existing synthetic tasks typically target narrow skill sets such as retrieving information from massive input, limiting their ability to comprehensively assess model capabilities. Furthermore, existing benchmarks often pair each task with a different input context, creating confounding factors that prevent fair cross-task comparison. To address these limitations, we introduce SYNC, a new evaluation suite of synthetic tasks spanning domains including graph understanding and translation. Each domain includes three tasks designed to test a wide range of capabilities—from retrieval, to multi-hop tracking, and to global context understanding that that requires chain-of-thought (CoT) reasoning. Crucially, all tasks share the same context, enabling controlled comparisons of model performance. We evaluate 14 LLMs on SYNC and observe substantial performance drops on more challenging tasks, underscoring the benchmark’s difficulty. Additional experiments highlight the necessity of CoT reasoning and demonstrate that poses a robust challenge for future models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — controlled comparison

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shuyang Cao , Kaijian Zou , Lu Wang

Topics

Artificial Intelligence > Core AI > Foundation Models Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Evaluation Deep Learning > Learning Types > Representation Learning

Keywords

chain-of-thought reasoning model evaluation context window long-context understanding synthetic benchmark large language model controlled comparison

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025