Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech Recognition

Kwangyoun Kim; Suwon Shon; Yi-Te Hsu; Prashant Sridhar; Karen Livescu; Shinji Watanabe

2024 INTERSPEECH INTERSPEECH 2024

Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech Recognition

Abstract

Parameter-efficient fine-tuning (PEFT) methods, which train only a part of a model, yield efficient and effective models. Bottleneck approaches, such as adapters and low-rank adaptation (LoRA), have been found to be beneficial in numerous studies and are widely utilized. In this work, we propose and investigate an enhanced PEFT method that adds convolution to linear projection-based bottleneck approaches. We experiment with HuBERT, a representative speech model pre-trained with self-supervised learning, and fine-tune it for the automatic speech recognition (ASR) task to examine how the proposed PEFT method impacts training and inference. We demonstrate consistent performance improvements with a minimal increase in parameters and computational complexity.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — convolution augmentation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Kwangyoun Kim , Suwon Shon , Yi-Te Hsu , Prashant Sridhar , Karen Livescu , Shinji Watanabe

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Efficient Computing Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

self-supervised learning automatic speech recognition parameter-efficient fine-tuning convolution augmentation bottleneck adapter

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024