Online Knowledge Distillation of Decoder-Only Large Language Models for Efficient Speech Recognition

Jeehye Lee; Hyeji Seo

2024 INTERSPEECH INTERSPEECH 2024

Online Knowledge Distillation of Decoder-Only Large Language Models for Efficient Speech Recognition

Abstract

Large language models (LLMs), which show promising performance in generation tasks, have proven their capabilities to be applied in a wide range of tasks. Although there are several approaches to adapt LLMs as decoder in speech recognition tasks, these can slow down inference speed, which is an important issue for the product-level systems. To address this problem, we introduce online knowledge distillation methods to transfer information from the decoder-only LLMs to a more compact Transformer decoder during the training phase. Implementing our proposed methods on a multilingual low-resource dataset, we achieved a 8.2% relative character error rate (CER) reduction compared to the LLM decoder model with much lower inference cost and a 34.7% relative CER reduction compared to the attention-based encoder-decoder (AED) model. Furthermore, we obtained a 14.9% relative CER reduction along with the same inference cost on a general Korean dataset.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — decoder-only architecture

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Jeehye Lee , Hyeji Seo

Topics

Machine Learning > Application Areas > Knowledge Distillation Natural Language Processing > Resources & Methods > Large Language Models Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

model compression knowledge distillation automatic speech recognition character error rate decoder-only architecture large language model

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024