2024 IJCAI IJCAI 2024

Generating More Audios for End-to-End Spoken Language Understanding

Abstract

End-to-end spoken language understanding (SLU) aims to directly capture the comprehensive semantics from the given spoken utterance without generating any transcript. Since the transcripts might not always be available, Textless SLU is attracting increasing attention, which could eliminate the need for transcripts but often does not perform as well as SLU models trained with transcripts. In this paper, we focus on the scenarios where the transcripts are not available and propose a framework GMA-SLU to generate more audios according to the labels. In order to alleviate the modality gap between text and audio, two language models are developed and discrete tokens are utilized as a bridge, where the first language model utilizes labels to generate semantic tokens and the second language model adopts these obtained semantic tokens and the acoustic tokens of source audios to generate the synthetic audios. All the experiments are conducted on the monolingual SLU dataset SLURP and the multilingual SLU dataset MINDS-14. Experimental results show that our method outperforms the previous best Textless End-to-end SLU models and can obtain the comparable performance with the models trained with the assistance of the corresponding transcripts.

๐ŸŒ‰ Interdisciplinary Bridge โ€” Artificial Intelligence and Natural Language Processing and Speech & Audio
๐Ÿฃ Hot Topic Early Bird โ€” audio generation
๐Ÿ Cross-Pollinator โ€” Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio