Temporal coding with magnitude-phase regularization for sound event detection

Sangwook Park; Sandeep Reddy Kothinti; Mounya Elhilali

2022 INTERSPEECH INTERSPEECH 2022

Temporal coding with magnitude-phase regularization for sound event detection

Abstract

Sound Event Detection (SED) is the challenge of identifying sound events into their temporal boundaries as well as sound category. With recent advances in deep learning, more effective SED techniques are investigated through the annual challenge of Detection and Classification of Acoustic Scenes and Events (DCASE). Most SED systems rely on data-driven learning where a deep neural network is trained to minimize the error between model prediction and the truth. While this framework is generally effective at identifying sound classes present in an audio recording, it results in unreliable estimates of temporal information for identifying sound boundaries. In order to heighten the temporal precision, this paper proposes a novel temporal coding of magnitude and phase for embedding vectors in an intermediate layer. This coding is reflected as a regularization term in the objective function for training the model. The regularization allows magnitude of embedding vectors to increase near event boundaries, which represent the onset and offset points. Simultaneously, each of the boundaries are distinguishable from others using phase difference between two neighboring vectors. This approach results in notable improvement in timing sensitivity compared to a baseline system tested on SED task in the context of DCASE2021 challenge.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — event boundary

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sangwook Park , Sandeep Reddy Kothinti , Mounya Elhilali

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Models > Generative Models Deep Learning > Techniques > Normalization Speech & Audio > Analysis > Speech Analysis

Keywords

temporal coding deep neural network sound event detection embedding vector event boundary phase difference magnitude-phase regularization

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022