Unspeech: Unsupervised Speech Context Embeddings

Benjamin Milde; Chris Biemann

2018 INTERSPEECH INTERSPEECH 2018

Unspeech: Unsupervised Speech Context Embeddings

Abstract

We introduce "Unspeech" embeddings, which are based on unsupervised learning of context feature representations for spoken language. The embeddings were trained on up to 9500 hours of crawled English speech data without transcriptions or speaker information, by using a straightforward learning objective based on context and non-context discrimination with negative sampling. We use a Siamese convolutional neural network architecture to train Unspeech embeddings and evaluate them on speaker comparison, utterance clustering and as a context feature in TDNN-HMM acoustic models trained on TED-LIUM, comparing it to i-vector baselines. Particularly decoding out-of-domain speech data from the recently released Common Voice corpus shows consistent WER reductions. We release our source code and pre-trained Unspeech models under a permissive open source license.

🧭 Keyword Pioneer — speech embedding

🐣 Hot Topic Early Bird — unsupervised learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Speech & Audio

📈 Trend Setter — Self-Supervised Learning

Authors

Benjamin Milde , Chris Biemann

Topics

Machine Learning > Core Methods > Embedding Learning Machine Learning > Learning Types > Unsupervised Learning Speech & Audio > Recognition > Speech Recognition Speech & Audio > Analysis > Speech Analysis Deep Learning > Learning Types > Self-Supervised Learning Artificial Intelligence > Learning Paradigms > Self-Supervised Learning Machine Learning > Learning Paradigms > Self-Supervised Learning

Keywords

unsupervised learning speaker verification speaker comparison siamese neural network siamese network negative sampling speech embedding context representation siamese convolutional neural network context feature utterance clustering

Download PDF

Related papers

HoloCompanion: An MR Friend for EveryOne 2018

Estimation of the Vocal Tract Length of Vowel Sounds Based on the Frequency of the Significant Spectral Valley 2018

Deep Learning Techniques for Koala Activity Detection 2018

An Exploration of Local Speaking Rate Variations in Mandarin Read Speech 2018

Acoustic Analysis of Whispery Voice Disguise in Mandarin Chinese 2018