2024 INTERSPEECH INTERSPEECH 2024

GPA: Global and Prototype Alignment for Audio-Text Retrieval

Abstract

Recent Audio-Text Retrieval (ATR) models have achieved progressive results, which pursue semantic interaction upon audio and text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning between audio and text. In this paper, we present GPA for ATR to achieve both Global (coarse-grained) and Prototype (fine-grained) Alignment. In detail, apart from performing vanilla global contrast between audio and text pairs, we model the frames in audio and words in text as prototypes, and align the prototypes to generate a prototype similarity matrix. Based on this, we introduce a Learnable Attention Similarity Scoring module, which can fully consider the information between different prototype pairs and obtain the retrieval score. Finally, we incorporate the Sinkhorn-Knopp algorithm to modify the retrieval score. Experimental results on two benchmark datasets with superior performance justify the efficacy of our proposed GPA.

🧭 Keyword Pioneer — learnable attention
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio