Can Large Language Models Translate Spoken-Only Languages through International Phonetic Transcription?

Jiale Chen; Xuelian Dong; Qihao Yang; Wenxiu Xie; Tianyong Hao

2025 EMNLP EMNLP 2025

Can Large Language Models Translate Spoken-Only Languages through International Phonetic Transcription?

Abstract

AbstractSpoken-only languages are languages without a writing system. They remain excluded from modern Natural Language Processing (NLP) advancements like Large Language Models (LLMs) due to their lack of textual data. Existing NLP research focuses primarily on high-resource or written low-resource languages, leaving spoken-only languages critically underexplored. As a popular NLP paradigm, LLMs have demonstrated strong few-shot and cross-lingual generalization abilities, making them a promising solution for understanding and translating spoken-only languages. In this paper, we investigate how LLMs can translate spoken-only languages into high-resource languages by leveraging international phonetic transcription as an intermediate representation. We propose UNILANG, a unified language understanding framework that learns to translate spoken-only languages via in-context learning. Through automatic dictionary construction and knowledge retrieval, UNILANG equips LLMs with more fine-grained knowledge for improving word-level semantic alignment. To support this study, we introduce the SOLAN dataset, which consists of Bai (a spoken-only language) and its corresponding translations in a high-resource language. A series of experiments demonstrates the effectiveness of UNILANG in translating spoken-only languages, potentially contributing to the preservation of linguistic and cultural diversity. Our dataset and code will be publicly released.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — international phonetic transcription

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiale Chen , Xuelian Dong , Qihao Yang , Wenxiu Xie , Tianyong Hao

Topics

Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Multilingual NLP Artificial Intelligence > Core AI > Large Language Models Natural Language Processing > Generation > Machine Translation Deep Learning > Models > Large Language Models

Keywords

machine translation in-context learning cross-lingual transfer low-resource language spoken language phonetic transcription language preservation large language model international phonetic transcription

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025