2019 INTERSPEECH INTERSPEECH 2019

Temporally-Aware Acoustic Unit Discovery for Zerospeech 2019 Challenge

Abstract

Zero-resource speech processing efforts focus on unsupervised discovery of sub-word acoustic units. Common approaches work with spatial similarities between the acoustic frame representations within Bayesian or neural network-based frameworks. We propose two methods that utilize the temporal proximity information in addition to the acoustic similarity for clustering frames into acoustic units. The first approach uses a temporally biased self-organizing map (SOM) to discover such units. Since the SOM unit indices are correlated with (vector) spatial distance, we pool neighboring units and then train a recurrent neural network to predict each pooled unit. The second approach incorporates temporal awareness by training a recurrent sparse autoencoder, in which unsupervised clustering is done on the intermediate softmax layer. This network is then fine-tuned using aligned pairs of acoustically similar sequences obtained via unsupervised term discovery. Our approaches outperform the provided baseline system on two main metrics of the Zerospeech 2019 challenge, ABX-discriminability and bitrate of the quantized embeddings, both for English and the surprise language. Furthermore, the temporal-awareness and the post-filtering techniques adopted in this work resulted in an enhanced continuity of the decoding, yielding low bitrates.

🧭 Keyword Pioneer — recurrent sparse autoencoder
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio