ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification

Liwen Zhang; Jiqing Han; Ziqiang Shi

2020 INTERSPEECH INTERSPEECH 2020

ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification

Abstract

Convolutional Neural Networks (CNNs) have been widely investigated on Acoustic Scene Classification (ASC). Where the convolutional operation can extract useful semantic contents from a local receptive field in the input spectrogram within certain Manhattan distance, i.e., the kernel size. Although stacking multiple convolution layers can increase the range of the receptive field, without explicitly considering the temporal relations of different receptive fields, the increased range is limited around the kernel. In this paper, we propose a 3D CNN for ASC, named ATReSN-Net, which can capture temporal relations of different receptive fields from arbitrary time-frequency locations by mapping the semantic features obtained from the residual block into a semantic space. The ATReSN module has two primary components: first, a k-NN-based grouper for gathering a semantic neighborhood for each feature point in the feature maps. Second, an attentive pooling-based temporal relations aggregator for generating the temporal relations embedding of each feature point and its neighborhood. Experiments showed that our ATReSN-Net outperforms most of the state-of-the-art CNN models. We shared our code at ATReSN-Net.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — residual block

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Liwen Zhang , Jiqing Han , Ziqiang Shi

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Neural Networks

Keywords

convolutional neural network 3d convolution attentive pooling temporal relation acoustic scene classification residual block semantic neighborhood

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020