Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

Khurram Azeem Hashmi; Talha Uddin Sheikh; Didier Stricker; Muhammad Zeshan Afzal

2025 WACV WACV 2025

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

Abstract

The primary challenge in Video Object Detection (VOD) is effectively exploiting temporal information to enhance object representations. Traditional strategies such as aggregating region proposals often suffer from feature variance due to the inclusion of background information. We introduce a novel instance mask-based feature aggregation approach significantly refining this process and deepening the understanding of object dynamics across video frames. We present FAIM a new VOD method that enhances temporal Feature Aggregation by leveraging Instance Mask features. In particular we propose the lightweight Instance Feature Extraction Module (IFEM) to learn instance mask features and the Temporal Instance Classification Aggregation Module (TICAM) to aggregate instance mask and classification features across video frames. Using YOLOX as a base detector FAIM achieves 87.9% mAP on the ImageNet VID dataset at 33 FPS on a single 2080Ti GPU setting a new benchmark for the speed-accuracy trade-off. Additional experiments on multiple datasets validate that our approach is robust method-agnostic and effective in multi-object tracking demonstrating its broader applicability to video understanding tasks.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Khurram Azeem Hashmi , Talha Uddin Sheikh , Didier Stricker , Muhammad Zeshan Afzal

Topics

Computer Vision > Analysis > Object Detection Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Self-Supervised Learning

Keywords

temporal modeling object tracking instance segmentation feature aggregation video object detection

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025