Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders

Wang Lin; Qingsong Wang; Yueying Feng; Shulei Wang; Tao Jin; Zhou Zhao; Fei Wu; Chang Yao; Jingyuan Chen

2025 CVPR CVPR 2025

Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders

Abstract

Large language models (LLMs) have significantly enhanced cross-modal understanding capabilities by integrating visual encoders with textual embeddings, giving rise to multimodal large language models (MLLMs). However, these models struggle with non-natural images such as geometric and charts, particularly in fields like education and finance. Despite efforts to collect datasets and fine-tune the MLLMs, the gap with natural image understanding is still evident, and the cost of collecting large and diverse non-natural image datasets is high. To address this, we analyzed the limitations of transformer-based vision encoders(ViT) within existing MLLMs from a frequency perspective. Studies have shown that ViT models are less effective at capturing high-frequency information, impairing their ability to capture elements like points, lines, and angles in non-natural images. In response, we introduced FM-ViT, a frequency-modulated vision encoder that utilizes Fourier decomposition to extract high and low frequency components from self-attention features and re-weight them during tuning to non-natural images. In addition, we combine the features of CNN models with FM-ViT and propose EDGE, an MLLM with enhanced graphical encoders tailored for understanding non-natural images. Extensive experiments have confirmed the effectiveness of our FM-ViT and EDGE in 4 types.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — non-natural image

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wang Lin , Qingsong Wang , Yueying Feng , Shulei Wang , Tao Jin , Zhou Zhao , Fei Wu , Chang Yao , Jingyuan Chen

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Computer Vision > Core AI > Computer Vision Deep Learning > Learning Types > Multi-Modal Learning

Keywords

vision transformer fourier decomposition multimodal large language model frequency domain image understanding vision encoder frequency analysis high-frequency information non-natural image

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025