TfCleanformer: A streaming, array-agnostic, full- and sub-band modeling front-end for robust ASR

Jens Heitkaemper; Joe Caroselli; Arun Narayanan; Nathan Howard

2024 INTERSPEECH INTERSPEECH 2024

TfCleanformer: A streaming, array-agnostic, full- and sub-band modeling front-end for robust ASR

Abstract

Multiple recent publications have demonstrated the benefits of neural network based enhancement in the time-frequency domain. This paper builds on those findings to improve upon a recently published streaming, array agnostic multi-channel enhancement system called Cleanformer. The proposed streaming enhancement system achieves competitive results against a non-causal state-of-the-art model on a source separation task, outperforming Cleanformer. Additionally, the presented model improves upon Cleanformer enhancement results in multiple challenging environments without introducing further latency. A short ablation study is performed to evaluate the influence of the proposed changes on the improved performance.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jens Heitkaemper , Joe Caroselli , Arun Narayanan , Nathan Howard

Topics

Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Synthesis > Speech Enhancement

Keywords

automatic speech recognition speech enhancement streaming model time-frequency domain neural network

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024