Scaling Action Detection: AdaTAD++ with Transformer-Enhanced Temporal-Spatial Adaptation

Tanay Agrawal; Abid Ali; Antitza Dantcheva; francois bremond

2025 ICCV ICCV 2025

Scaling Action Detection: AdaTAD++ with Transformer-Enhanced Temporal-Spatial Adaptation

Abstract

Temporal Action Detection (TAD) is essential for analyzing long-form videos by identifying and segmenting actions within untrimmed sequences. While recent innovations like Temporal Informative Adapters (TIA) have improved resolution, memory constraints still limit large video processing. To address this, we introduce AdaTAD++, an enhanced framework that decouples temporal and spatial processing within adapters, organizing them into independently trainable modules. Our novel two-step training strategy first optimizes for high temporal and low spatial resolution, then vice versa, allowing the model to utilize both high spatial and temporal resolutions during inference, while maintaining training efficiency. Additionally, we incorporate a more sophisticated temporal module capable of capturing long-range dependencies more effectively than previous methods. Experiments on benchmark datasets, including ActivityNet-1.3, THUMOS14, and EPIC-Kitchens 100, demonstrate that AdaTAD++ achieves state-of-the-art performance. We also explore various adapter configurations, discussing their trade-offs regarding resource constraints and performance, providing valuable insights into their optimal application.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tanay Agrawal , Abid Ali , Antitza Dantcheva , francois bremond

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Model Architecture Computer Vision > Analysis > Action Recognition Computer Vision > Processing > Video Understanding Machine Learning > Application Areas > Model Compression Artificial Intelligence > Core AI > Efficient Computing

Keywords

action recognition video understanding transformer encoder temporal adaptation spatial adaptation temporal action detection parameter efficient transfer learning

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025