neDIOM: Dataset and Analysis of Nepali Idioms

Rhitabrat Pokharel; Ameeta Agrawal

2025 COLING COLING 2025

neDIOM: Dataset and Analysis of Nepali Idioms

Abstract

AbstractIdioms, integral to any language, convey nuanced meanings and cultural references. However, beyond English, few resources exist to support any meaningful exploration of this unique linguistic phenomenon. To facilitate such an inquiry in a low resource language, we introduce a novel dataset of Nepali idioms and the sentences in which these naturally appear. We describe the methodology of creating this resource as well as discuss some of the challenges we encountered. The results of our empirical analysis under various settings using four distinct multilingual models consistently highlight the difficulties these models face in processing Nepali figurative language. Even fine-tuning the models yields limited benefits. Interestingly, the larger models from the BLOOM family of models failed to consistently outperform the smaller models. Overall, we hope that this new resource will facilitate further development of models that can support processing of idiomatic expressions in low resource languages such as Nepali.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Interdisciplinary and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Rhitabrat Pokharel , Ameeta Agrawal

Topics

Deep Learning > Models > Generative Models Interdisciplinary > Linguistics > Computational Linguistics Machine Learning > Learning Types > Representation Learning Natural Language Processing > Resources & Methods > Language Modeling Natural Language Processing > Applications > Text Generation Machine Learning > Learning Types > Deep Learning Machine Learning > Learning Types > Evaluation Artificial Intelligence > Core AI > Language

Keywords

text generation language model evaluation low-resource language multilingual model figurative language language resource idiom processing

Download PDF

Related papers

Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection 2025

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution 2025

Positive Text Reframing under Multi-strategy Optimization 2025

RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration 2025

Two-stage Incomplete Utterance Rewriting on Editing Operation 2025