Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

Murali Karthick Baskar; Andrew Rosenberg; Bhuvana Ramabhadran; Neeraj Gaur; Zhong Meng

2024 INTERSPEECH INTERSPEECH 2024

Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

Abstract

In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or alter the inference pipeline. We also propose language-based soft prompting to further improve with frozen LLMs. Empirical analysis on realtime testset from 10 Indic languages demonstrate that our proposed speech prefix-tuning yields improvements with both frozen and fine-tuned LLMs. Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12% relative improvement in WER over the baseline with a fine-tuned LLM. Our proposed approches with the frozen LLM leads to a 31% relative improvement over basic soft-prompting prefixLM.

🧭 Keyword Pioneer — speech prefix-tuning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Murali Karthick Baskar , Andrew Rosenberg , Bhuvana Ramabhadran , Neeraj Gaur , Zhong Meng

Topics

Machine Learning > Optimization & Theory > Neural Network Optimization Machine Learning > Application Areas > Efficient Computing

Keywords

automatic speech recognition connectionist temporal classification soft prompting large language model speech prefix-tuning

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024