Injection Without Distortion: Geometrically Constrained Knowledge Enhancement for Vision-Language Models

Zhongze Wu; Xiu Su; Feng Yang; Shan You; Jun Long; Yueyi Luo

2026 AAAI AAAI 2026

Injection Without Distortion: Geometrically Constrained Knowledge Enhancement for Vision-Language Models

Abstract

Abstract Vision-Language Models (VLMs) are widely used in tasks like Open-Vocabulary Object Detection and zero-shot Classification, owing to their powerful generalization. However, recent research reveals that VLMs exhibit significant performance instability when tasked with recognizing concepts at varying granularities (e.g., ``animal'' vs. ``dog''). Prevailing methods inject external knowledge from Large Language Models, but this unconstrained approach distorts the VLM's inherent hierarchical orthogonal geometry, leading to performance collapse on general concepts. To address this, we introduce GeCoin, an innovative Geometrically Constrained framework that safely enhances existing VLMs with external knowledge for improved hierarchical understanding, without additional training. By projecting knowledge into the null-space of a query concept's feature space, GeCoin mathematically guarantees the preservation of general knowledge while integrating specialized information. Extensive experiments across large-scale benchmarks, diverse VLMs, and knowledge from various LLMs (e.g., GPT-3.5, Claude-3, Gemini-Pro) show that GeCoin boosts performance by an average of 3.9% over the strongest baseline—crucially eradicating performance collapse on general concepts.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhongze Wu , Xiu Su , Feng Yang , Shan You , Jun Long , Yueyi Luo

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Multimodal Learning

Keywords

feature geometry vision-language model zero-shot classification knowledge injection open-vocabulary object detection hierarchical understanding

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026