2022 INTERSPEECH INTERSPEECH 2022

Temporal coding with magnitude-phase regularization for sound event detection

Abstract

Sound Event Detection (SED) is the challenge of identifying sound events into their temporal boundaries as well as sound category. With recent advances in deep learning, more effective SED techniques are investigated through the annual challenge of Detection and Classification of Acoustic Scenes and Events (DCASE). Most SED systems rely on data-driven learning where a deep neural network is trained to minimize the error between model prediction and the truth. While this framework is generally effective at identifying sound classes present in an audio recording, it results in unreliable estimates of temporal information for identifying sound boundaries. In order to heighten the temporal precision, this paper proposes a novel temporal coding of magnitude and phase for embedding vectors in an intermediate layer. This coding is reflected as a regularization term in the objective function for training the model. The regularization allows magnitude of embedding vectors to increase near event boundaries, which represent the onset and offset points. Simultaneously, each of the boundaries are distinguishable from others using phase difference between two neighboring vectors. This approach results in notable improvement in timing sensitivity compared to a baseline system tested on SED task in the context of DCASE2021 challenge.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio
🧭 Keyword Pioneer — event boundary
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio