ART: Attention-Regularized Transformers for Multi-Modal Robustness
Abstract
AbstractTransformers have become the standard in Natural Language Processing (NLP) and Computer Vision (CV) due to their strong performance, yet they remain highly sensitive to small input changes, often referred to as adversarial attacks, such as synonym swaps in text or pixel-level perturbations in images. These adversarial attacks can mislead predictions, while existing defenses are often domain-specific or lack formal robustness guarantees. We propose the Attention-Regularized Transformer (ART), a framework that enhances robustness across modalities. ART builds on the Attention Sensitivity Tensor (AST), which quantifies the effect of input perturbations on attention outputs. By incorporating an AST-based regularizer into training, ART encourages stable attention maps under adversarial perturbations in both text and image tasks. We evaluate ART on IMDB, QNLI, CIFAR-10, CIFAR-100, and Imagenette. Results show consistent robustness gains over strong baselines such as FreeLB and DSRM: up to +36.9% robust accuracy on IMDB and QNLI, and +5–25% on image benchmarks across multiple Vision Transformer (ViT) architectures, while maintaining or improving clean accuracy. ART is also highly efficient, training over 10× faster than adversarial methods on text and requiring only 1.25× the cost of standard training on images, compared to 1.5–5.5× for recent robust ViTs. Codes are available at [https://github.com/cliclab-um6p/ART](https://github.com/cliclab-um6p/ART)