UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

Zhenyu Chen; Ronghang Hu; Xinlei Chen; Matthias Nießner; Angel X. Chang

2023 ICCV ICCV 2023

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

Abstract

Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. UniT3D enables learning a strong multimodal representation across the two tasks through a supervised joint pre-training scheme with bidirectional and seq-to-seq objectives. With a generic architecture design, UniT3D allows expanding the pre-training scope to more various training sources such as the synthesized data from 2D prior knowledge to benefit 3D vision-language tasks. Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhenyu Chen , Ronghang Hu , Xinlei Chen , Matthias Nießner , Angel X. Chang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Computer Vision > Analysis > 3D Vision

Keywords

transformer architecture multimodal learning joint training dense captioning 3d visual grounding

Download PDF

Related papers

PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework 2023

Periodically Exchange Teacher-Student for Source-Free Object Detection 2023

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations 2023

Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles 2023

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation 2023