L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream

Jingtao Sun; Yaonan Wang; Mingtao Feng; Yulan Guo; Ajmal Mian; Mike Zheng Shou

2024 CVPR CVPR 2024

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream

Abstract

3D visual language multi-modal modeling plays an important role in actual human-computer interaction. However the inaccessibility of large-scale 3D-language pairs restricts their applicability in real-world scenarios. In this paper we aim to handle a real-time multi-task for 6-DoF pose tracking of unknown objects leveraging 3D-language pre-training scheme from a series of 3D point cloud video streams while simultaneously performing 3D shape reconstruction in current observation. To this end we present a generic Language-to-4D modeling paradigm termed L4D-Track that tackles zero-shot 6-DoF \underline Track ing and shape reconstruction by learning pairwise implicit 3D representation and multi-level multi-modal alignment. Our method constitutes two core parts. 1) Pairwise Implicit 3D Space Representation that establishes spatial-temporal to language coherence descriptions across continuous 3D point cloud video. 2) Language-to-4D Association and Contrastive Alignment enables multi-modality semantic connections between 3D point cloud video and language. Our method trained exclusively on public NOCS-REAL275 dataset achieves promising results on both two publicly benchmarks. This not only shows powerful generalization performance but also proves its remarkable capability in zero-shot inference.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — language-to-4d modeling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jingtao Sun , Yaonan Wang , Mingtao Feng , Yulan Guo , Ajmal Mian , Mike Zheng Shou

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Analysis > 3D Vision Computer Vision > Analysis > Object Tracking Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning Computer Vision > Processing > 3D Vision

Keywords

point cloud 3d vision language grounding shape reconstruction language model multimodal alignment 3d shape reconstruction 3d point cloud zero-shot inference 6-dof tracking language-to-4d modeling zero-shot tracking 6-dof pose tracking

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024