Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Zhengfei Xu; Sijia Zhao; Yanchao Hao; Xiaolong Liu; Lili Li; Yuyang Yin; Bo Li; Xi Chen; Xin Xin

2025 AAAI AAAI 2025

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Abstract

Abstract Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — fine-grained visual understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Zhengfei Xu , Sijia Zhao , Yanchao Hao , Xiaolong Liu , Lili Li , Yuyang Yin , Bo Li , Xi Chen , Xin Xin

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Generation > Image Captioning Artificial Intelligence > Core AI > Knowledge Graph Deep Learning > Models > Vision-Language Models

Keywords

knowledge base pixel-level annotation visual semantic fine-grained visual understanding visual entity linking semantic tokenization region-interacted attention

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025