2022 INTERSPEECH INTERSPEECH 2022

WA-Transformer: Window Attention-based Transformer with Two-stage Strategy for Multi-task Audio Source Separation

Abstract

The standard Conformer adopts convolution layers to exploit local features. However, the one-dimensional convolution ignores the correlation of adjacent time-frequency features. In this paper, we design a two-dimensional window attention block with dilation, and then we propose a window attention-based Transformer network (named WA-Transformer) for multi-task audio source separation. The proposed WA-Transformer adopts self-attention and window attention blocks to model global dependencies and local correlation in a parameter-efficient way. Besides, it follows a two-stage pipeline, in which the first stage separates the mixture and outputs the three types of audio signals, and the second stage performs signal compensation. Experiments demonstrate the effectiveness of WA-Transformer. WA-Transformer achieves 13.86 dB, 12.22 dB, 11.21 dB signal-to-distortion ratio improvement on speech, music, noise track, respectively, and advantages over several well-known models.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio