DGSRN: Noise-Robust Speech Recognition Method with Dual-Path Gated Spectral Refinement Network

Wenjun Wang; Shangbin Mo; Ling Dong; Zhengtao Yu; Junjun Guo; Yuxin Huang

2024 INTERSPEECH INTERSPEECH 2024

DGSRN: Noise-Robust Speech Recognition Method with Dual-Path Gated Spectral Refinement Network

Abstract

The advancements in speech recognition have led to significant progress in predicting clean speech. However, challenges persist in real-world noisy environments. Addressing issues such as speech distortion and noise residue in signals processed by speech enhancement models, we propose a noise-robust speech recognition method based on the Dual-Path Gated Spectral Refinement Network (DGSRN). We construct a single-channel speech enhancement model based on dense time-frequency convolutional networks for the first stage of noise suppression. And the Dual-Path Gated Spectral Refinement Network is designed to extract useful features from estimated noise to enhance speech quality. Multi-task joint training is conducted using a weighted speech distortion loss function. Experimental results demonstrate that compared to traditional joint training approaches, DGSRN achieves a 12.41% reduction in Character Error Rate, addressing the issue of mismatched performance on evaluation metrics.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — noise-robust speech recognition

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Wenjun Wang , Shangbin Mo , Ling Dong , Zhengtao Yu , Junjun Guo , Yuxin Huang

Topics

Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Synthesis > Speech Enhancement

Keywords

speech enhancement dual-path network character error rate noise-robust speech recognition spectral refinement multi-task joint training

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024