On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning

Titouan Parcollet; Shucong Zhang; Rogier van Dalen; Alberto Gil C. P. Ramos; Sourav Bhattacharya

2023 INTERSPEECH INTERSPEECH 2023

On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning

Abstract

Speech representations learned with self-supervised learning (SSL) have the potential to significantly improve the performance of a number of audio applications, especially when availability of labeled data from the deployment domain is limited. Despite their successes, SSL training methods are compute- and memory-heavy, and require large investments in computing infrastructure, thus putting it out of the reach of most institutions. Therefore, building efficient model architectures is essential for the wide-scale adoption of SSL in speech technologies. CNN-based Acoustic Feature Extractors (AFE), which are widely used as encoders of acoustic waveforms, remain one of the main efficiency bottlenecks. This work proposes replacing CNN-based AFEs with more efficient ones and demonstrates that SSL pre-training time and memory consumption can be reduced by a factor of two to three over existing methods while preserving performances in speech-, command-, and speaker-recognition tasks.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio