Scaling Properties of Diffusion Models For Perceptual Tasks

Rahul Ravishankar; Zeeshan Patel; Jathushan Rajasegaran; Jitendra Malik

2025 CVPR CVPR 2025

Scaling Properties of Diffusion Models For Perceptual Tasks

Abstract

In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rahul Ravishankar , Zeeshan Patel , Jathushan Rajasegaran , Jitendra Malik

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Analysis > Depth Estimation Computer Vision > Processing > Image Segmentation Computer Vision > Processing > Image Processing Computer Vision > Processing > Motion Estimation Computer Vision > Processing > Depth Estimation

Keywords

semantic segmentation image segmentation visual perception depth estimation image-to-image translation optical flow diffusion model compute scaling amodal segmentation perceptual task

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025