UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

Chen Tang; Xinzhu Ma; Encheng Su; Xiufeng Song; Xiaohong Liu; Wei-Hong Li; LEI BAI; Wanli Ouyang; Xiangyu Yue

2025 CVPR CVPR 2025

UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

Abstract

Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce UniSTD, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at https://github.com/1hunters/UniSTD.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chen Tang , Xinzhu Ma , Encheng Su , Xiufeng Song , Xiaohong Liu , Wei-Hong Li , LEI BAI , Wanli Ouyang , Xiangyu Yue

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Machine Learning > Learning Types > Multi-Task Learning Deep Learning > Models > Foundation Models Deep Learning > Learning Types > Multi-Task Learning

Keywords

transformer architecture representation learning multi-task learning domain adaptation spatio-temporal learning foundation model mixture of expert temporal module spatiotemporal learning

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025