Reduce, Reuse, Recycle: Is Perturbed Data Better than Other Language Augmentation for Low Resource Self-Supervised Speech Models

Asad Ullah; Alessandro Ragano; Andrew Hines

2024 INTERSPEECH INTERSPEECH 2024

Reduce, Reuse, Recycle: Is Perturbed Data Better than Other Language Augmentation for Low Resource Self-Supervised Speech Models

Abstract

Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. A common approach is cross-lingual pre-training. Instead, we propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition. Our comparisons found that a combined synthetic augmentations (noise/pitch) strategy outperformed accent and language knowledge transfer. Furthermore, we examined the scaling factor of augmented data to achieve equivalent performance to model pre-trained with target domain speech. Our findings suggest that for resource-constrained languages, combined augmentations can be a viable option than other augmentations.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Asad Ullah , Alessandro Ragano , Andrew Hines

Topics

Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Data Augmentation Deep Learning > Techniques > Pretraining

Keywords

self-supervised learning phoneme recognition low-resource language cross-lingual pre-training speech model self-supervised representation learning pitch variation audio augmentation

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024