2020 INTERSPEECH INTERSPEECH 2020

What Does an End-to-End Dialect Identification Model Learn About Non-Dialectal Information?

Abstract

An end-to-end dialect identification system generates the likelihood of each dialect, given a speech utterance. The performance relies on its capabilities to discriminate the acoustic properties between the different dialects, even though the input signal contains non-dialectal information such as speaker and channel. In this work, we study how non-dialectal information are encoded inside the end-to-end dialect identification model. We design several proxy tasks to understand the model’s ability to represent speech input for differentiating non-dialectal information — such as (a) gender and voice identity of speakers, (b) languages, (c) channel (recording and transmission) quality — and compare with dialectal information (i.e., predicting geographic region of the dialects). By analyzing non-dialectal representations from layers of an end-to-end Arabic dialect identification (ADI) model, we observe that the model retains gender and channel information throughout the network while learning a speaker-invariant representation. Our findings also suggest that the CNN layers of the end-to-end model mirror feature extractors capturing voice-specific information, while the fully-connected layers encode more dialectal information.

The Questioner
🧭 Keyword Pioneer — channel information
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio
🌉 Interdisciplinary Bridge — Artificial Intelligence and Interdisciplinary and Machine Learning and Speech & Audio