2026 AAAI AAAI 2026

Multi-view Invariance Learning for 3D Scene Graph Pre-training via Collaborative Cross-Modal Regularization

Abstract

Abstract 3D scene graph generation is a pivotal task in scene understanding. Its performance is easy to be constrained by the limited availability of annotated data. Currently, the existing solutions on point cloud pre-training usually emphasize on object-centric representations while neglecting the predicate feature learning. This limitation significantly hinders their relational reasoning capabilities, as inter-object relationships are fundamentally governed by predicate features. To enhance 3D Scene Graphs Pre-training, this paper proposes a task-specific Multi-view Invariance Learning framework with Collaborative Cross-modal Regularization. In detail, the inherent horizontal-rotation invariance of 3D objects and their semantic relationships are leveraged to construct a self-supervised paradigm for triplet feature learning. Moreover, our framework harnesses the cross-modal prior knowledge from the vision-language model to regularize model optimization. It could further achieve the semantic discrimination via unsupervised deep clustering. To resolve the knowledge discrepancies arising from the pre-trained model in fine-tuning, a predicate adapter equipped with knowledge filtering gate is devised to selectively aggregate the predicate features of pre-trained model. Extensive experiments demonstrate that our framework is effective in boosting 3D scene graph generation performance, surpassing state-of-the-art ones.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning
🧭 Keyword Pioneer — predicate feature
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio