Cross-Aligned Fusion for Multimodal Understanding

Abhishek Rajora; Shubham Gupta; Suman Kundu

2025 WACV WACV 2025

Cross-Aligned Fusion for Multimodal Understanding

Abstract

Recent multimodal frameworks often grapple with semantic misalignment and noise impeding effective integration of diverse modalities. In order to solve this problem this study presents CaMN (Cross-aligned Multimodal Network) a framework designed to enhance multimodal understanding through a robust cross-alignment mechanism. Unlike conventional fusion methods our framework aligns features extracted from images text and graphs via a tailored loss function enabling seamless integration and exploitation of complementary information. Leveraging Abstract Meaning Representation (AMR) we extract intricate semantic structures from textual data enriching the multimodal representation with contextual depth. Furthermore to enhance robustness we employ a masked autoencoder to simulate noise-independent feature space. Through comprehensive evaluation on the crisisMMD dataset CaMN demonstrates superior performance in crisis event classification tasks highlighting its potential in advancing multimodal understanding across diverse domains. Our code is available at https://github.com/brillard1/CaMN.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — crisis classification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Abhishek Rajora , Shubham Gupta , Suman Kundu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Multi-Modal Learning Natural Language Processing > Resources & Methods > Multimodal NLP Deep Learning > Learning Types > Multimodal Learning

Keywords

feature extraction multimodal learning text representation semantic alignment semantic structure cross-modal alignment feature fusion masked autoencoder graph neural network crisis classification

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025