A comparison of voice similarity through acoustics, human perception and deep neural network (DNN) speaker verification systems

Suyuan Liu; Molly Babel; Jian Zhu

2024 INTERSPEECH INTERSPEECH 2024

A comparison of voice similarity through acoustics, human perception and deep neural network (DNN) speaker verification systems

Abstract

Voice similarity can be assessed through acoustic analysis, perceptual judgments by human listeners, and the recent addition of automatic speaker verification systems. However, a comparison across the similarity judgments made from acoustics, listener perception, and deep neural network (DNN) based speaker verification systems has not yet been made. This project fills this gap by comparing acoustic similarity scores generated from 24 acoustic dimensions and verification scores generated by seven pretrained speaker verification models using the Wespeaker toolkit to perceptual similarity assessed by human listeners in an AX discrimination task and a (dis)similarity rating task. Results suggest verification similarities correlate with acoustic similarities, but not with human perceptual similarities when controlled for talker pair, indicating the correlation between listeners and speaker verification models happens at a gross-phonetic level rather than a fine phonetic level.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Interdisciplinary and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Suyuan Liu , Molly Babel , Jian Zhu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Classification Interdisciplinary > Linguistics > Phonetics

Keywords

human perception speaker verification deep neural network acoustic analysis voice similarity

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024