MULTI-CONVFORMER: Extending Conformer with Multiple Convolution Kernels

Darshan Prabhu; Yifan Peng; Preethi Jyothi; Shinji Watanabe

2024 INTERSPEECH INTERSPEECH 2024

MULTI-CONVFORMER: Extending Conformer with Multiple Convolution Kernels

Abstract

Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition (ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce MULTI-CONVFORMER that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate (WER) improvements.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — local context

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Darshan Prabhu , Yifan Peng , Preethi Jyothi , Shinji Watanabe

Topics

Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

automatic speech recognition parameter efficiency word error rate convolution kernel conformer model end-to-end asr local context

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024