Cascaded Multilingual Audio-Visual Learning from Videos

Andrew Rouditchenko; Angie Boggust; David Harwath; Samuel Thomas; Hilde Kuehne; Brian Chen; Rameswar Panda; Rogerio Feris; Brian Kingsbury; Michael Picheny; James Glass

2021 INTERSPEECH INTERSPEECH 2021

Cascaded Multilingual Audio-Visual Learning from Videos

Abstract

In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages, such as Japanese videos. With our cascaded approach, we show an improvement in retrieval performance of nearly 10× compared to training on the Japanese videos solely. We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🧭 Keyword Pioneer — cascaded approach

🐣 Hot Topic Early Bird — audio-visual learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Andrew Rouditchenko , Angie Boggust , David Harwath , Samuel Thomas , Hilde Kuehne , Brian Chen , Rameswar Panda , Rogerio Feris , Brian Kingsbury , Michael Picheny , James Glass

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Domain Adaptation Computer Vision > Processing > Video Understanding

Keywords

self-supervised learning cross-lingual transfer audio-visual learning video retrieval instructional video multilingual learning multilingual representation cross-lingual retrieval cascaded approach

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021