Hyper-Depth: Hypergraph-based Multi-Scale Representation Fusion for Monocular Depth Estimation

Lin Bie; Siqi Li; Yifan Feng; Yue Gao

2025 ICCV ICCV 2025

Hyper-Depth: Hypergraph-based Multi-Scale Representation Fusion for Monocular Depth Estimation

Abstract

Monocular depth estimation (MDE) is a fundamental problem in computer vision with wide-ranging applications in various downstream tasks. While multi-scale features are perceptually critical for MDE, existing transformer-based methods have yet to leverage them explicitly. To address this limitation, we propose a hypergraph-based multi-scale representation fusion framework, Hyper-Depth.The proposed Hyper-Depth incorporates two key components: a Semantic Consistency Enhancement (SCE) module and a Geometric Consistency Constraint (GCC) module. The SCE module, designed based on hypergraph convolution, aggregates global information and enhances the representation of multi-scale patch features. Meanwhile, the GCC module provides geometric guidance to reduce over-fitting errors caused by excessive reliance on local features. In addition, we introduce a Correlation-based Conditional Random Fields (C-CRFs) module as the decoder to filter correlated patches and compute attention weights more effectively.Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches across all evaluation metrics on the KITTI and NYU-Depth-v2 datasets, achieving improvements of 6.21% and 3.32% on the main metric RMSE, respectively. Furthermore, zero-shot evaluations on the nuScenes and SUN-RGBD datasets validate the generalizability of our approach.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Lin Bie , Siqi Li , Yifan Feng , Yue Gao

Topics

Deep Learning > Architectures > Transformers Deep Learning > Architectures > Graph Neural Networks Computer Vision > Analysis > Depth Estimation Computer Vision > Processing > Semantic Segmentation

Keywords

transformer architecture monocular depth estimation semantic consistency multi-scale feature geometric consistency conditional random field hypergraph convolution

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025