Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR

Shaojun Li; Daimeng Wei; Hengchao Shang; Jiaxin Guo; Zongyao Li; Zhanglin Wu; Zhiqiang Rao; Yuanchang Luo; Xianghui He; Hao Yang

2024 INTERSPEECH INTERSPEECH 2024

Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR

Abstract

Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model output by finding correctly pronounced tokens from its pre-built datastore during the decoding phase. Moreover, we utilize x-vector to dynamically adjust kNN interpolation parameters for data sparsity issue. This approach was validated using KeSpeech and MagicData corpora under in-domain and all-domain settings. Our method consistently performs comparably to fine-tuning without the associated performance degradation during speaker changes. Furthermore, in the all-domain setting, our method achieves state-of-the-art results, reducing the CER in both single speaker and multi-speaker test scenarios.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — token retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Shaojun Li , Daimeng Wei , Hengchao Shang , Jiaxin Guo , Zongyao Li , Zhanglin Wu , Zhiqiang Rao , Yuanchang Luo , Xianghui He , Hao Yang

Topics

Machine Learning > Core Methods > Metric Learning Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

k-nearest neighbor speaker adaptation end-to-end speech recognition x-vector embedding token retrieval

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024