Dynamic-Static Collaboration for Unsupervised Domain Adaptive Video-Based Visible-Infrared Person Re-Identification
Abstract
Abstract Video-based visible-infrared person re-identification (VVI-ReID) aims to match pedestrian sequences across modalities for all-day surveillance. While supervised methods have shown progress, their dependence on large-scale cross-modal annotations limits scalability. We investigate the task of unsupervised domain adaptation for VVI-ReID (UDA-VVI-ReID), where a model trained on a labeled source domain is adapted to an unlabeled target domain. Directly extending existing image-based unsupervised VI-ReID methods to video scenarios by simply averaging frame-level features is suboptimal, as this naive strategy neglects the rich temporal dynamics in video data and leads to unreliable pseudo-labels due to occlusion-induced noise. To overcome these limitations, we propose a Dynamic-Static Collaboration (DSC) framework that explicitly leverages the complementary strengths of motion and appearance cues. The Dynamic-Static Label Unification (DSLU) module refines pseudo-labels by validating the consistency between static and dynamic predictions. Based on these labels, the Dynamic-Static Joint Learning (DSJL) module performs neighbor-aware contrastive learning in both feature spaces, promoting robust representation learning under cross-modal and temporal variations. Experiments on HITSZ-VCM and BUPTCampus show that DSC sets a strong baseline for this new task, enabling robust cross-modal video ReID without target labels.