SEQ-former: A context-enhanced and efficient automatic speech recognition framework

Qinglin Meng; Min Liu; Kaixun Huang; Kun Wei; Lei Xie; Zongfeng Quan; Weihong Deng; Quan Lu; Ning Jiang; Guoqing Zhao

2024 INTERSPEECH INTERSPEECH 2024

SEQ-former: A context-enhanced and efficient automatic speech recognition framework

Abstract

Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — prediction decoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Qinglin Meng , Min Liu , Kaixun Huang , Kun Wei , Lei Xie , Zongfeng Quan , Weihong Deng , Quan Lu , Ning Jiang , Guoqing Zhao

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

contextual information automatic speech recognition connectionist temporal classification prediction decoder decoding efficiency efficient conformer

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024