2024 INTERSPEECH INTERSPEECH 2024

SEQ-former: A context-enhanced and efficient automatic speech recognition framework

Abstract

Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio
🧭 Keyword Pioneer — prediction decoder
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio