2023 INTERSPEECH INTERSPEECH 2023

Cross-Modal Semantic Alignment before Fusion for Two-Pass End-to-End Spoken Language Understanding

Abstract

The deliberation-based two-pass model that combines both semantic and acoustic information can effectively improve the performance of end-to-end (E2E) spoken language understanding (SLU). However, existing two-pass models usually simply fuse speech embedding and text embedding without taking into account the inherent distinctions between these two modalities. We propose a novel approach named Cross-modal Semantic Alignment before Fusion (CSAF), which adopt contrastive loss aligning speech and text embeddings before fusing them. We introduce a shared semantic memory transformer to project the embeddings from two modalities into a common semantic space, and a multi-modal gated network to generate the fused embeddings. We conduct experiments on the FSC Challenge test set and SLURP dataset. The results demonstrate that our method can significantly promote intent classification accuracy, achieving an absolute improvement of 3.1% over previous works in the FSC Challenge Utterance Set.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🧭 Keyword Pioneer — semantic memory transformer
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio