2024 INTERSPEECH INTERSPEECH 2024

A comparison of voice similarity through acoustics, human perception and deep neural network (DNN) speaker verification systems

Abstract

Voice similarity can be assessed through acoustic analysis, perceptual judgments by human listeners, and the recent addition of automatic speaker verification systems. However, a comparison across the similarity judgments made from acoustics, listener perception, and deep neural network (DNN) based speaker verification systems has not yet been made. This project fills this gap by comparing acoustic similarity scores generated from 24 acoustic dimensions and verification scores generated by seven pretrained speaker verification models using the Wespeaker toolkit to perceptual similarity assessed by human listeners in an AX discrimination task and a (dis)similarity rating task. Results suggest verification similarities correlate with acoustic similarities, but not with human perceptual similarities when controlled for talker pair, indicating the correlation between listeners and speaker verification models happens at a gross-phonetic level rather than a fine phonetic level.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Interdisciplinary and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio