The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique Like Photographers

Daiqing Qi; Handong Zhao; Jing Shi; Simon Jenni; Yifei Fan; Franck Dernoncourt; Scott Cohen; Sheng Li

2025 CVPR CVPR 2025

The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique Like Photographers

Abstract

Photographer, curator, and former director of photography at the Museum of Modern Art (MoMA), John Szarkowski remarked in *William Eggleston's Guide*, "While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky." Szarkowski insightfully revealed a notable gap between general and aesthetic visual understanding: while the former emphasizes identifying factual elements in an image (the sky), the latter transcends mere object identification, viewing it instead as an aesthetic component--a pure expanse of blue, valued purely as a color block in visual aesthetics. Such distinctions between general visual understanding (detection, localization, etc.) and aesthetic perception (color, lighting, composition, etc.) pose a significant challenge for existing Multimodal Large Language Models (MLLMs) in comprehending image aesthetics, which is increasingly needed in real-world applications, from image recommendation and enhancement to generation. To fundamentally advance the aesthetic understanding of MLLMs, we introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, distinguished by its large scale, expertise, and diversity. Additionally, we propose a new model, PhotoEye, an MLLM featuring a language-guided multi-view vision fusion mechanism for understanding image aesthetics from multiple perspectives. Finally, we introduce PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. Our model demonstrates significant advantages over both open-source and commercial models on existing benchmarks and PhotoBench.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — aesthetic understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Daiqing Qi , Handong Zhao , Jing Shi , Simon Jenni , Yifei Fan , Franck Dernoncourt , Scott Cohen , Sheng Li

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Scene Understanding Deep Learning > Models > Large Language Models Deep Learning > Models > Multimodal Learning

Keywords

multimodal learning multimodal large language model visual understanding large language model photographic composition aesthetic understanding visual aesthetics image critique image aesthetics photography critique

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025