Data-Efficient Multimodal Fusion on a Single GPU

Noël Vouitsis; Zhaoyan Liu; Satya Krishna Gorti; Valentin Villecroze; Jesse C. Cresswell; Guangwei Yu; Gabriel Loaiza-Ganem; Maksims Volkovs

2024 CVPR CVPR 2024

Data-Efficient Multimodal Fusion on a Single GPU

Abstract

The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment we achieve competitive performance - and in certain cases outperform state-of-the art methods - in both image-text and audio-text retrieval with orders of magnitude less compute and data: for example we outperform CLIP on the Flickr30K text-to-image retrieval task with ?600x fewer GPU days and ?80x fewer image-text pairs. Additionally we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — unimodal encoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Noël Vouitsis , Zhaoyan Liu , Satya Krishna Gorti , Valentin Villecroze , Jesse C. Cresswell , Guangwei Yu , Gabriel Loaiza-Ganem , Maksims Volkovs

Topics

Machine Learning > Core Methods > Embedding Learning Machine Learning > Application Areas > Efficient Computing Deep Learning > Techniques > Pretraining Machine Learning > Learning Types > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Techniques > Transfer Learning

Keywords

embedding learning efficient computing audio-text retrieval latent space image-text retrieval multimodal fusion latent space alignment pre-trained encoder unimodal encoder

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024