2023 INTERSPEECH INTERSPEECH 2023

Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representations

Abstract

Personalised speech enhancement (PSE) extracts only the speech of a target user and removes everything else from corrupted input audio. This can greatly improve on-device streaming audio processing, such as voice calls and speech recognition, which has strict requirements on model size and latency. To focus the PSE system on the target speaker, it is conditioned on a recording of the user's voice. This recording is usually summarised as a single static vector. However, a static vector cannot reflect all the target user's voice characteristics. Thus, we propose using the full recording. To condition on such a variable-length sequence, we propose fully Transformer-based PSE models with a cross-attention mechanism which generates target speaker representations dynamically. To better reflect the on-device scenario, we carefully design and publish a new PSE dataset. On the dataset, our proposed model significantly surpasses strong baselines while halving the model size and reducing latency.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio