Vinclat: Evaluating Reasoning, Cognition and Culture in One Game

Marc Pàmies; Javier Aula-Blasco; Aitor Gonzalez-Agirre; Marta Villegas

2026 EACL EACL 2026

Vinclat: Evaluating Reasoning, Cognition and Culture in One Game

Abstract

AbstractThis paper introduces Vinclat, a novel evaluation dataset for Catalan carefully designed to assess the reasoning capabilities and cultural knowledge of LLMs. It comprises 1,000 high-quality instances, meticulously crafted and reviewed by human annotators. Each instance presents a complex riddle that requires a two-step reasoning process involving inferential and abductive reasoning, along with other cognitive skills such as lexical retrieval, paraphrasing, flexibility in interpretation, pattern recognition, and associative thinking. Given four independent clues, models should infer intermediate concepts which, despite being seemingly unrelated, can be creatively connected to reach a final solution. The task targets a unique blend of capabilities, distinguishing it from existing NLP benchmarks. Our evaluation of state-of-the-art models reveals that these still fall significantly short of human-level reasoning, although scaling trends suggest that the performance gap may narrow over time. This indicates that Vinclat provides a robust and long-term challenge, resisting the rapid saturation that is commonly observed in many existing evaluation datasets.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Marc Pàmies , Javier Aula-Blasco , Aitor Gonzalez-Agirre , Marta Villegas

Topics

Artificial Intelligence > Core AI > Causal Inference Natural Language Processing > Resources & Methods > Large Language Models

Keywords

benchmark evaluation cultural knowledge abductive reasoning large language model

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026