Beyond WER: Probing Whisper’s Sub‐token Decoder Across Diverse Language Resource Levels

Siyu Liang; Nicolas Ballier; Gina-Anne Levow; Richard Wright

2025 EMNLP EMNLP 2025

Beyond WER: Probing Whisper’s Sub‐token Decoder Across Diverse Language Resource Levels

Abstract

AbstractWhile large multilingual automatic speech recognition (ASR) models achieve remarkable performance, the internal mechanisms of the end-to-end pipeline, particularly concerning fairness and efficacy across languages, remain underexplored. This paper introduces a fine-grained analysis of Whisper’s multilingual decoder, examining its sub-token hypotheses during transcription across languages with various resource levels. Our method traces the beam search path, capturing sub-token guesses and their associated probabilities. Results reveal that higher resource languages benefit from higher likelihood of the correct token being top-ranked, greater confidence, lower predictive entropy, and more diverse alternative candidates. Lower resource languages fare worse on these metrics, but also exhibit distinct clustering patterns in sub-token usage sometimes influenced by typology in our PCA and t-SNE analysis. This sub-token probing uncovers systematic decoding disparities masked by aggregate error rates and points towards targeted interventions to ameliorate the imbalanced development of speech technology.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — sub-token analysis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Siyu Liang , Nicolas Ballier , Gina-Anne Levow , Richard Wright

Topics

Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Resources & Methods > Multilingual NLP Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Recognition > Speech Recognition Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Transformers

Keywords

transformer architecture model probing automatic speech recognition beam search multilingual model multilingual speech recognition speech technology language resource sub-token analysis decoder probing predictive entropy

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025