Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation

Yichi Zhang; Zhuo Chen; Lingbing Guo; Yajing Xu; Binbin Hu; Ziqi Liu; Wen Zhang; Huajun Chen

2025 AAAI AAAI 2025

Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation

Abstract

Abstract Multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given multi-modal knowledge graphs (MMKG), collaboratively leveraging structural information from the triples and multi-modal information of the entities to overcome the inherent incompleteness. Existing MMKGC methods usually extract multi-modal features with pre-trained models and employ fusion modules to integrate multi-modal features for the entities. This often results in coarse handling of multi-modal entity information, overlooking the nuanced, fine-grained semantic details and their complex interactions. To tackle this shortfall, we introduce a novel framework MyGO to tokenize, fuse, and augment the fine-grained multi-modal representations of entities and enhance the MMKGC performance. Motivated by the tokenization technology, MyGO tokenizes multi-modal entity information as fine-grained discrete tokens and learns entity representations with a cross-modal entity encoder. To further augment the multi-modal representations, MyGO incorporates fine-grained contrastive learning to highlight the specificity of the entity representations. Experiments on standard MMKGC benchmarks reveal that our method surpasses 19 of the latest models, underlining its superior performance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Knowledge & Reasoning and Machine Learning

🧭 Keyword Pioneer — multi-modal knowledge graph completion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yichi Zhang , Zhuo Chen , Lingbing Guo , Yajing Xu , Binbin Hu , Ziqi Liu , Wen Zhang , Huajun Chen

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Contrastive Learning Knowledge & Reasoning > Representation > Knowledge Graphs Deep Learning > Models > Transformers Deep Learning > Learning Types > Multi-Modal Learning

Keywords

representation learning contrastive learning multi-modal learning knowledge graph completion entity representation cross-modal encoding multi-modal knowledge graph completion

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025