Tackling Missing Modalities in Audio-Visual Representation Learning Using Masked Autoencoders

Georgios Chochlakis; Chandrashekhar Lavania; Prashant Mathur; Kyu J. Han

2024 INTERSPEECH INTERSPEECH 2024

Tackling Missing Modalities in Audio-Visual Representation Learning Using Masked Autoencoders

Abstract

Audio-visual representations leverage information from both modalities to produce joint representations. Such representations have demonstrated their usefulness in a variety of tasks. However, both modalities incorporated in the learned model might not necessarily be present all the time during inference. In this work, we study whether and how we can make existing models, trained under pristine conditions, robust to partial modality loss without retraining them. We propose to use a curriculum trained Masked AutoEncoder, to impute features of missing input segments. We show that fine-tuning of classification heads with the imputed features make the base models robust on multiple downstream tasks like emotion recognition and Lombard speech recognition. Among the 12 cases evaluated, our method outperforms strong baselines in 10 instances.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Georgios Chochlakis , Chandrashekhar Lavania , Prashant Mathur , Kyu J. Han

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Autoencoders Machine Learning > Learning Types > Transfer Learning

Keywords

representation learning curriculum learning multimodal learning masked autoencoder missing modality feature imputation

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024