Cross-Modality Diffusion Modeling and Sampling for Speech Recognition

Chia-Kai Yeh; Chih-Chun Chen; Ching-Hsien Hsu; Jen-Tzung Chien

2024 INTERSPEECH INTERSPEECH 2024

Cross-Modality Diffusion Modeling and Sampling for Speech Recognition

Abstract

The diffusion model excels as a generative model for continuous data within a single modality. To extend its effectiveness to speech recognition, where the continuous speech frames are used as the condition to generate the discrete word tokens, building a conditional diffusion across discrete state space becomes crucial. This paper introduces a non-autoregressive discrete diffusion model, enabling parallel generation of a word string corresponding to a speech signal through iterative diffusion steps. An acoustic transformer encoder identifies the speech representation, serving as the condition for a denoising transformer decoder to predict the whole discrete sequence. To address the redundancy reduction in cross-modality diffusion, an additional feature decorrelation objective is integrated during optimization. This paper further reduces the inference time by using a fast sampling approach. The experiments on speech recognition illustrate the merit of the proposed method.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — acoustic transformer encoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Chia-Kai Yeh , Chih-Chun Chen , Ching-Hsien Hsu , Jen-Tzung Chien

Topics

Deep Learning > Models > Diffusion Models Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

speech recognition diffusion model discrete diffusion non-autoregressive generation acoustic transformer encoder

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024