RUST: Latent Neural Scene Representations From Unposed Imagery

Mehdi S. M. Sajjadi; Aravindh Mahendran; Thomas Kipf; Etienne Pot; Daniel Duckworth; Mario Lucic; Klaus Greff

2023 CVPR CVPR 2023

RUST: Latent Neural Scene Representations From Unposed Imagery

Abstract

Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — pose-free training

🐣 Hot Topic Early Bird — 3d scene

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mehdi S. M. Sajjadi , Aravindh Mahendran , Thomas Kipf , Etienne Pot , Daniel Duckworth , Mario Lucic , Klaus Greff

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers Computer Vision > Analysis > 3D Vision Computer Vision > Generation > Image Generation Deep Learning > Models > Transformers Computer Vision > Generation > 3D Generation Computer Vision > Processing > 3D Vision

Keywords

transformer architecture 3d reconstruction pose estimation novel view synthesis camera pose neural scene representation 3d scene latent embedding pose-free training

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023