Scaling Effect of Self-Supervised Speech Models

Jie Pu; Yuguang Yang; Ruirui Li; Oguz Elibol; Jasha Droppo

2021 INTERSPEECH INTERSPEECH 2021

Scaling Effect of Self-Supervised Speech Models

Abstract

The success of modern deep learning systems is built on two cornerstones, massive amount of annotated training data and advanced computational infrastructure to support large-scale computation. In recent years, the model size of state-of-the-art deep learning systems has rapidly increased and sometimes reached to billions of parameters. Herein we take a close look into this phenomenon and present an empirical study on the scaling effect of model size for self-supervised speech models. In particular, we investigate the quantitative relationship between the model size and the loss/accuracy performance on speech tasks. First, the power-law scaling property between the number of parameters and the L1 self-supervised loss is verified for speech models. Then the advantage of large speech models in learning effective speech representations is demonstrated in two downstream tasks: i) speaker recognition and ii) phoneme classification. Moreover, it has been shown that the model size of self-supervised speech networks is able to compensate the lack of annotation when there is insufficient training data.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — power-law scaling

🐣 Hot Topic Early Bird — model scaling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jie Pu , Yuguang Yang , Ruirui Li , Oguz Elibol , Jasha Droppo

Topics

Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Models > Generative Models Speech & Audio > Recognition > Speaker Recognition Machine Learning > Learning Types > Transfer Learning Deep Learning > Learning Types > Representation Learning

Keywords

representation learning self-supervised learning speaker recognition phoneme classification model scaling speech representation power-law scaling

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021