Bridging Language Gaps in Audio-Text Retrieval

Zhiyong Yan; Heinrich Dinkel; Yongqing Wang; Jizhong Liu; Junbo Zhang; Yujun Wang; Bin Wang

2024 INTERSPEECH INTERSPEECH 2024

Bridging Language Gaps in Audio-Text Retrieval

Abstract

Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10\% of additional language-enhanced training data, yielding promising results.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — cross-lingual retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhiyong Yan , Heinrich Dinkel , Yongqing Wang , Jizhong Liu , Junbo Zhang , Yujun Wang , Bin Wang

Topics

Machine Learning > Application Areas > Knowledge Distillation Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

knowledge distillation text representation cross-modal retrieval audio-text retrieval language model cross-lingual retrieval multilingual encoder

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024