Towards Long-window Anchoring in Vision-Language Model Distillation

Haoyi Zhou; Shuo Li; Tianyu Chen; Qi Song; Chonghan Gao; Jianxin Li

2026 AAAI AAAI 2026

Towards Long-window Anchoring in Vision-Language Model Distillation

Abstract

Abstract While large vision-language models (VLMs) demonstrate impressive long-context understanding, their prevalent small branches fails on linguistics-photography alignment for limited window size. We discover that knowledge distillation improve students capability as compelementary to Rotary Position Embeddings (RoPE) on certain windows size (anchored from large models). Building on this insight, we propose LAid, which explicitly targets the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2× longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haoyi Zhou , Shuo Li , Tianyu Chen , Qi Song , Chonghan Gao , Jianxin Li

Topics

Machine Learning > Application Areas > Knowledge Distillation Deep Learning > Architectures > Transformers

Keywords

attention mechanism knowledge distillation vision-language model rotary position embedding long-context understanding

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026