2022 INTERSPEECH INTERSPEECH 2022

Deep Sparse Conformer for Speech Recognition

Abstract

Conformer has achieved impressive results in Automatic Speech Recognition (ASR) by leveraging transformer's capturing of content-based global interactions and convolutional neural network's exploiting of local features. In Conformer, two macaron-like feed-forward layers with half-step residual connections sandwiches the multi-head self-attention and convolution modules followed by a post layer normalization. We improve Conformer's long-sequence representation ability in two directions, sparser and deeper. We adapt a sparse self-attention mechanism with O(LlogL) in time complexity and memory usage. A deep normalization strategy is utilized when performing residual connections to ensure our training of hundred-level Conformer blocks. On the Japanese CSJ-500h dataset, this deep sparse Conformer achieves respectively CERs of 5.52%, 4.03% and 4.50% on the three evaluation sets and 4.16%, 2.84% and 3.20% when ensembling five deep sparse Conformer variants from 12 to 16, 17, 50, and finally 100 encoder layers.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio
🧭 Keyword Pioneer — deep normalization
🐣 Hot Topic Early Bird — feed-forward network
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors