PronScribe: Highly Accurate Multimodal Phonemic Transcription From Speech and Text

Yang Yu; Matthew Perez; Ankur Bapna; Fadi Haik; Siamak Tazari; Yu Zhang

2023 INTERSPEECH INTERSPEECH 2023

PronScribe: Highly Accurate Multimodal Phonemic Transcription From Speech and Text

Abstract

We present PronScribe, a novel method for phonemic transcription from speech and text input based on careful fine-tuning and adaptation of a massive, multilingual, multimodal speech-text pretrained model. We show that our model is capable of phonemically transcribing pronunciations of full utterances with accurate word boundaries in a variety of languages covering diverse phonological phenomena, achieving phoneme error rates in the vicinity of 1-2% which is comparable to human transcribers. We show that PronScribe can effectively learn this task from relatively little training data, making it attractive even in low-resource settings. It learns from text and speech simultaneously in a coherent way, and is better than previous models using speech, text or both. Additionally, the model's good transfer learning characteristics in multilingual settings can effectively boost performance for lower-resourced languages.

🧭 Keyword Pioneer — phonemic transcription

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Speech & Audio