Matrix3D: Large Photogrammetry Model All-in-One

Yuanxun Lu; Jingyang Zhang; Tian Fang; Jean-Daniel Nahmias; Yanghai Tsin; Long Quan; Xun Cao; Yao Yao; Shiwei Li

2025 CVPR CVPR 2025

Matrix3D: Large Photogrammetry Model All-in-One

Abstract

We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: https://nju-3dv.github.io/projects/matrix3d.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuanxun Lu , Jingyang Zhang , Tian Fang , Jean-Daniel Nahmias , Yanghai Tsin , Long Quan , Xun Cao , Yao Yao , Shiwei Li

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Analysis > 3D Vision

Keywords

pose estimation multi-modal learning novel view synthesis diffusion transformer mask learning depth prediction

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025