Zero-shot Out-of-domain is No Joke: Lessons Learned in the VoiceMOS 2023 MOS Prediction Challenge

Marie Kunešová; Jan Lehečka; Josef Michálek; Jindřich Matoušek; Jan Švec

2024 INTERSPEECH INTERSPEECH 2024

Zero-shot Out-of-domain is No Joke: Lessons Learned in the VoiceMOS 2023 MOS Prediction Challenge

Abstract

This paper describes our team’s experiences in the VoiceMOS Challenge 2023 - a challenge centered around the evaluation of the quality of synthetic or noisy speech. Inspired by our success with an ensemble approach in the first VoiceMOS Challenge in 2022, we submitted an ensemble of four models this time, based on wav2vec 2.0, QuartzNet, CNN-RNN, and LDNet. This was enough to win one of the two tracks we participated in (Track 1b). However, post-challenge analysis shows that only two of the models offer a meaningful contribution in any of the VoiceMOS 2023 tracks, while the other two only degrade the ensemble’s overall performance. On the other hand, post-challenge results on Track 2 (singing voice conversion data) surpassed all our expectations. In the paper, we explain how we tried to deal with the new zero-shot out-of-domain scenarios, analyze the results, and discuss the lessons learned.

🧭 Keyword Pioneer — speaker quality prediction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Marie Kunešová , Jan Lehečka , Josef Michálek , Jindřich Matoušek , Jan Švec

Topics

Machine Learning > Learning Types > Zero-Shot Learning Machine Learning > Application Areas > Efficient Computing

Keywords

zero-shot learning ensemble learning speech synthesis out-of-domain detection ensemble method speech quality assessment speech quality mean opinion score speaker quality prediction

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024