Octopus: Towards Building the Arabic Speech LLM Suite

Sara Althubaiti; Vasista Sai Lodagala; Tjad Clark; Yousseif Ahmed Elshahawy; Daniel Izham; Abdullah Alrajeh; Aljawahrah Bin Tamran; Ahmed Ali

2025 EMNLP EMNLP 2025

Octopus: Towards Building the Arabic Speech LLM Suite

Abstract

AbstractWe present Octopus, a first family of modular speech-language models designed for Arabic-English ASR, dialect identification, and speech translation. Built on Whisper-V3 and enhanced with large language models like ALLaM, LLaMA, and DeepSeek, Octopus bridges speech and text through a lightweight projection layer and Q-Former. To broaden its scope beyond speech, Octopus integrates BEATs, a general-purpose audio encoder allowing it to understand both linguistic and acoustic events. Despite its simplicity, this dual-encoder design supports robust performance across multilingual and code-switched scenarios. We also introduce TinyOctopus, a distilled variant using smaller models (Distil-Whisper + LLaMA3-1B / DeepSeek-1.5B), achieving competitive results with just a fraction of the parameters. Fine-tuning on synthetic code-switched data further boosts its performance. Octopus demonstrates the power of compact, extensible architectures in Arabic-centric speech modeling and sets the stage for unified multilingual audio-language understanding.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sara Althubaiti , Vasista Sai Lodagala , Tjad Clark , Yousseif Ahmed Elshahawy , Daniel Izham , Abdullah Alrajeh , Aljawahrah Bin Tamran , Ahmed Ali

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Model Merging Natural Language Processing > Resources & Methods > Large Language Models

Keywords

model distillation speech recognition multimodal learning large language model

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025