ARXSA: A General Negative Feedback Control Theory in Vision-Language Models

Zeyu Zhang; Tianqi Chen; Yuki Todo

2025 EMNLP EMNLP 2025

ARXSA: A General Negative Feedback Control Theory in Vision-Language Models

Abstract

AbstractThe Transformer model has been increasingly applied across various domains, driven by the self-attention mechanism, which offers robust data processing capabilities and has substantially contributed to the advancement of the model. In the self-attention mechanism, three core matrices from the same data batch are computed together to determine correlations between input elements. Drawing inspiration from the efficiency and stability conferred by negative feedback structures in predictive control systems, the concept of vertical training was introduced to integrate data from multiple batches. Accordingly, this paper proposes an autoregressive with exogenous inputs (ARX) approach for the self-attention mechanism, transforming the Encoder block into a negative feedback predictive control system. A network architecture based on this method is also proposed, enabling the autoregressive with exogenous inputs for self-attention to transmit data from batches at previous time points. The effectiveness of the proposed approach is validated through comparative experimental evaluations.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zeyu Zhang , Tianqi Chen , Yuki Todo

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers

Keywords

self-attention mechanism autoregressive model vision-language model negative feedback

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025