A Unified Turkic Idiom Understanding Benchmark: Idiom Detection and Semantic Retrieval Across Five Turkic Languages

Gözde Aslantaş; Tunga Güngör

2026 EACL EACL 2026

A Unified Turkic Idiom Understanding Benchmark: Idiom Detection and Semantic Retrieval Across Five Turkic Languages

Abstract

AbstractIdiomatic expressions are culturally grounded, semantically opaque, and difficult to interpret for multilingual natural language processing systems. Despite the large speaker population of Turkic languages, resources that focus on monolingual and cross-lingual idioms and their meanings are limited. We introduce the first unified benchmark for idiom understanding across Turkish, Azerbaijani, Turkmen, Gagauz, and Uzbek languages. The datasets compiled include token-level idiom span annotations. We develop models for idiom identification and semantic retrieval tasks. We evaluate seven models for idiom identification and nine embedding models for semantic retrieval tasks under several fine-tuning schemes using standard dense retrieval metrics. This benchmark provides a basis for studying idiomatic phenomena in Turkic languages and clarifies how idiomatic meanings are shared, altered, or diverge across languages.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Gözde Aslantaş , Tunga Güngör

Topics

Natural Language Processing > Resources & Methods > Multilingual NLP Natural Language Processing > Resources & Methods > Text Representation

Keywords

multilingual nlp cross-lingual transfer semantic retrieval embedding model idiom detection

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026