Multimodal Gaussian Mixture Variational Autoencoder with Consistency Regularizations

Yarui Chen; Lehan Hong; Jianlin Shao; Jianning Yang; Tingting Zhao; Yun Liao; Yancui Shi

2026 AAAI AAAI 2026

Multimodal Gaussian Mixture Variational Autoencoder with Consistency Regularizations

Abstract

Abstract Variational autoencoder (VAE)-based frameworks possess a natural advantage in modeling the shared and private information inherent in multimodal data. However, current models focus on improving the quality of shared representations from the reconstruction perspective, lacking explicit mechanisms to model their underlying semantic structure. In this paper, we propose the multimodal Gaussian mixture variational autoencoder with consistency regularizations, which introduces a Gaussian mixture prior over the shared latent space to enhance its semantic structure and encourage the formation of cluster-aware latent representations. To address the cross-modal inconsistency problem under missing modality conditions, we propose a cluster-guided regularization strategy that enforces the cross-modal consistency using the pseudo-category labels from unsupervised clustering. Additionally, we design a self-supervised contrastive regularization strategy to align semantically similar representations across modalities. Extensive experiments on MNIST-SVHN and MNIST-CDCB datasets demonstrate that our method significantly outperforms prior state-of-the-art models in generation, classification, and retrieval tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yarui Chen , Lehan Hong , Jianlin Shao , Jianning Yang , Tingting Zhao , Yun Liao , Yancui Shi

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Clustering Deep Learning > Models > Variational Inference

Keywords

multimodal learning gaussian mixture model variational autoencoder

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026