Unsupervised Deep Auditory Model Using Stack of Convolutional RBMs for Speech Recognition

Hardik B. Sailor; Hemant A. Patil

2016 INTERSPEECH INTERSPEECH 2016

Unsupervised Deep Auditory Model Using Stack of Convolutional RBMs for Speech Recognition

Abstract

Recently, we have proposed an unsupervised filterbank learning model based on Convolutional RBM (ConvRBM). This model is able to learn auditory-like subband filters using speech signals as an input. In this paper, we propose two-layer Unsupervised Deep Auditory Model (UDAM) by stacking two ConvRBMs. The first layer ConvRBM learns filterbank from speech signals and hence, it represents early auditory processing. The hidden units’ responses of the first layer are pooled as short-time spectral representation to train another ConvRBM using greedy layer-wise method. The ConvRBM in second layer trained on spectral representation learns Temporal Receptive Field (TRF) which represent temporal properties of the auditory cortex in human brain. To show the effectiveness of the proposed UDAM, speech recognition experiments were conducted on TIMIT and AURORA 4 databases. We have shown that features extracted from second layer when added to filterbank features of first layer performs better than first layer features alone (and their delta features as well). For both databases, our proposed two-layer deep auditory features improve speech recognition performance over Mel filterbank features. Further improvements can be achieved by system-level combination of both UDAM features and Mel filterbank features.

🚀 Conference Pioneer — INTERSPEECH 2016

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — convolutional restricted boltzmann machine

🐣 Hot Topic Early Bird — unsupervised learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio