What Makes Cryptic Crosswords Challenging for LLMs?

Abdelrahman Sadallah; Daria Kotova; Ekaterina Kochmar

2025 COLING COLING 2025

What Makes Cryptic Crosswords Challenging for LLMs?

Abstract

AbstractCryptic crosswords are puzzles that rely on general knowledge and the solver’s ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance. We release our code and introduced datasets at https://github.com/bodasadallah/decrypting-crosswords.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — cryptic crossword solving

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Abdelrahman Sadallah , Daria Kotova , Ekaterina Kochmar

Topics

Artificial Intelligence > Core AI > Game AI Artificial Intelligence > Core AI > Interpretability Natural Language Processing > Applications > Question Answering Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Deep Learning

Keywords

benchmark evaluation puzzle solving natural language understanding game ai large language model cryptic crossword cryptic crossword solving

Download PDF

Related papers

Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection 2025

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution 2025

Positive Text Reframing under Multi-strategy Optimization 2025

RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration 2025

Two-stage Incomplete Utterance Rewriting on Editing Operation 2025