2026 EACL EACL 2026

Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language

Abstract

AbstractCan large language models converse in languages virtually absent from their training data? We investigate this question through a case study on Tulu, a Dravidian language with over two million speakers but minimal digital presence. Rather than fine-tuning, we examine whether structured prompt engineering alone can elicit basic conversational ability under extreme data scarcity. Our framework combines explicit grammar documentation, negative constraints to suppress high-probability tokens from related languages, romanization standardization, and quality-controlled synthetic data generation via self-play. Evaluated on a manually curated held-out set across three LLMs (Gemini 2.0 Flash, GPT-4o, and Llama 3.1 70B) and validated by native speakers, our approach reduces vocabulary contamination from 80% to 5% while achieving 85% grammatical accuracy. Cross-model analysis shows that negative constraints provide consistent improvements (12–18 percentage points), while the effectiveness of grammar documentation varies by model architecture (8–22 points). These results demonstrate that structured in-context learning can meaningfully extend LLM capabilities to extremely low-resource languages without parameter updates.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio