RiddleBench: A New Generative Reasoning Benchmark for LLMs

Deepon Halder; Alan Saji; Thanmay Jayakumar; Anoop Kunchukuttan; Ratish Puduppully; Raj Dabre

2026 EACL EACL 2026

RiddleBench: A New Generative Reasoning Benchmark for LLMs

Abstract

AbstractWhile Large Language Models (LLMs) show remarkable capabilities, their complex reasoning skills require deeper investigation. We introduce **RiddleBench**, a new benchmark of 1,737 challenging puzzles designed to test reasoning beyond simple pattern matching. Our evaluation of state-of-the-art models reveals significant limitations, including hallucination cascades (uncritically accepting flawed peer reasoning) and poor self-correction due to strong self-confirmation bias. We also find that model performance is fragile, degrading when faced with reordered constraints or irrelevant information. RiddleBench serves as a resource for diagnosing these issues and guiding the development of more robust LLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Deepon Halder , Alan Saji , Thanmay Jayakumar , Anoop Kunchukuttan , Ratish Puduppully , Raj Dabre

Topics

Artificial Intelligence > Core AI > Planning Natural Language Processing > Resources & Methods > Large Language Models

Keywords

hallucination detection complex reasoning reasoning benchmark large language model

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026