Locally Aligned Rectified Flow Model for Speech Enhancement Towards Single-Step Diffusion

Zhengxiao Li; Nakamasa Inoue

2024 INTERSPEECH INTERSPEECH 2024

Locally Aligned Rectified Flow Model for Speech Enhancement Towards Single-Step Diffusion

Abstract

Diffusion models based on stochastic differential equations have been shown to be effective in speech enhancement, a task of recovering clean speech signals from noisy speech signals. However, these models are limited by computational complexity, mainly due to the large number of function evaluations required in the reverse diffusion process. To address this limitation, we propose the locally aligned rectified flow (LARF) model, a diffusion model based on ordinary differential equations that learns a transport mapping between the distributions of clean and noisy speech features. By introducing global and local flow matching losses, LARF restricts the transport mapping to be as straight as possible, resulting in a reduction in the number of function evaluations. In experiments, we demonstrate the effectiveness of LARF on the two speech enhancement datasets: WSJ0-CHiME3 and VoiceBank-DEMAND. On WSJ0-CHiME3, LARF achieved a PESQ of 2.95 and an SI-SDR of 19.3 with a single step.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — single-step diffusion

🐣 Hot Topic Early Bird — rectified flow

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhengxiao Li , Nakamasa Inoue

Topics

Machine Learning > Optimization & Theory > Optimization Deep Learning > Models > Diffusion Models

Keywords

flow matching speech enhancement diffusion model ordinary differential equation rectified flow single-step diffusion

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024