Can Large Language Models Understand Spatial Audio?

Changli Tang; Wenyi Yu; Guangzhi Sun; Xianzhao Chen; Tian Tan; Wei Li; Jun Zhang; Lu Lu; Zejun Ma; Yuxuan Wang; Chao Zhang

2024 INTERSPEECH INTERSPEECH 2024

Can Large Language Models Understand Spatial Audio?

Abstract

This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs’ advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and localisation-informed speech extraction (LSE), achieving notable progress in each task. For SSL, our approach achieves an MAE of 2.70◦ on the Spatial LibriSpeech dataset, substantially surpassing the prior benchmark of about 6.60◦. Moreover, our model can employ spatial cues to improve FSR accuracy and execute LSE by selectively attending to sounds originating from a specified direction via text prompts, even amidst overlapping speech. These findings highlight the potential of adapting LLMs to grasp physical audio concepts, paving the way for LLM-based agents in 3D environments.

❓ The Questioner

🌉 Interdisciplinary Bridge — Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — speech extraction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Changli Tang , Wenyi Yu , Guangzhi Sun , Xianzhao Chen , Tian Tan , Wei Li , Jun Zhang , Lu Lu , Zejun Ma , Yuxuan Wang , Chao Zhang

Topics

Natural Language Processing > Resources & Methods > Large Language Models Speech & Audio > Recognition > Speech Recognition Speech & Audio > Analysis > Speaker Verification

Keywords

sound source localization far-field speech recognition multi-channel audio spatial audio speech extraction large language model

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024