Leveraging Pretrained Representations for Cross-Modal Point Cloud Completion

Kshitij Kale; Hrishikesh U; V sreenidhe; Shylaja S S

2026 WACV WACV 2026

Leveraging Pretrained Representations for Cross-Modal Point Cloud Completion

Abstract

The utility of 3D point clouds in critical applications like robotics is often hindered by their inherent incompleteness, a result of real-world occlusions and limited sensor viewpoints. To overcome this, image-guided 3D point cloud completion aims to reconstruct complete shapes by leveraging a corresponding 2D image. However, current methods typically train a cross-modal network from scratch, often failing to capture the high-level semantic context and complex structural information required for robust reconstruction. This paper challenges that paradigm by demonstrating that preexisting knowledge from large-scale, pretrained vision models can be effectively leveraged to guide the completion process. We introduce a novel Dual Branch Image Encoder, a dedicated module designed to extract and fuse rich semantic priors from a pretrained Vision Transformer with geometric depth cues. This fused representation provides a powerful, multifaceted guide that is integrated into EGIInet, a state-of-the-art point cloud completion network. Our experiments show that by conditioning the completion on these strong, pretrained priors, our method outperforms existing state-of-the-art techniques by 7% without changing the rest of the architecture, producing more semantically coherent and structurally accurate 3D shapes.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio