Multi-Level Cross-Modal Alignment for Image Clustering

Liping Qiu; Qin Zhang; Xiaojun Chen; Shaotian Cai

2024 AAAI AAAI 2024

Multi-Level Cross-Modal Alignment for Image Clustering

Abstract

Abstract Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pretraining model could produce poor-quality pseudo labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel Multi-level Cross-modal Alignment method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Liping Qiu , Qin Zhang , Xiaojun Chen , Shaotian Cai

Topics

Machine Learning > Core Methods > Clustering Deep Learning > Architectures > Transformers Machine Learning > Learning Types > Multi-Task Learning Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Self-Supervised Learning

Keywords

representation learning multi-modal learning semantic space image clustering cross-modal alignment pseudo label

Download PDF

Related papers

Goal Alignment: Re-analyzing Value Alignment Problems Using Human-Aware AI 2024

Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables 2024

Suppressing Uncertainty in Gaze Estimation 2024

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation 2024

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification 2024