2016 INTERSPEECH INTERSPEECH 2016

Jointly Learning to Locate and Classify Words Using Convolutional Networks

Abstract

In this paper, we propose a novel approach for weakly-supervised word recognition. Most state of the art automatic speech recognition systems are based on frame-level labels obtained through forced alignments or through a sequential loss. Recently, weakly-supervised trained models have been proposed in vision, that can learn which part of the input is relevant for classifying a given pattern [1]. Our system is composed of a convolutional neural network and a temporal score aggregation mechanism. For each sentence, it is trained using as supervision only some of the words (most frequent) that are present in a given sentence, without knowing their order nor quantity. We show that our proposed system is able to jointly classify and localize words. We also evaluate the system on a keyword spotting task, and show that it can yield similar performance to strong supervised HMM/GMM baseline.

πŸš€ Conference Pioneer β€” INTERSPEECH 2016
πŸŒ‰ Interdisciplinary Bridge β€” Deep Learning and Machine Learning and Speech & Audio
🧭 Keyword Pioneer β€” word recognition
🐣 Hot Topic Early Bird β€” convolutional neural network
🐝 Cross-Pollinator β€” Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio