Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes

Tom Fischer; Xiaojie Zhang; Eddy Ilg

2025 ICCV ICCV 2025

Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes

Abstract

Recognizing objects in images is a fundamental problem in computer vision. Although detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results for RGB category-level pose estimation on REAL275, improving on the current state-of-the-art by 22.9% averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits greater robustness compared to single-stage baselines.

🧭 Keyword Pioneer — multi-model ransac

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tom Fischer , Xiaojie Zhang , Eddy Ilg

Topics

Computer Vision > Analysis > 3D Vision Computer Vision > Analysis > Object Detection Computer Vision > Analysis > Semantic Segmentation Computer Vision > Domain-Specific > Autonomous Driving

Keywords

pose estimation object detection 3d vision neural mesh model category-level recognition 3d prototype multi-model ransac rgb image

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025