PILOT: Introducing Transformers for Probabilistic Sound Event Localization

Christopher Schymura; Benedikt Bönninghoff; Tsubasa Ochiai; Marc Delcroix; Keisuke Kinoshita; Tomohiro Nakatani; Shoko Araki; Dorothea Kolossa

2021 INTERSPEECH INTERSPEECH 2021

PILOT: Introducing Transformers for Probabilistic Sound Event Localization

Abstract

Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms. Additionally, the estimated sound event positions are represented as multivariate Gaussian variables, yielding an additional notion of uncertainty, which many previously proposed deep learning-based systems designed for this application do not provide. The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy. It outperforms all competing systems on all datasets with statistical significant differences in performance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Christopher Schymura , Benedikt Bönninghoff , Tsubasa Ochiai , Marc Delcroix , Keisuke Kinoshita , Tomohiro Nakatani , Shoko Araki , Dorothea Kolossa

Topics

Artificial Intelligence > Core AI > Trajectory Prediction Deep Learning > Architectures > Transformers Speech & Audio > Analysis > Speech Analysis Machine Learning > Learning Types > Uncertainty Quantification

Keywords

transformer architecture self-attention mechanism probabilistic modeling multivariate gaussian uncertainty quantification sound event localization

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021