Multi-modal Knowledge Distillation-based Human Trajectory Forecasting

Jaewoo Jeong; Seohee Lee; Daehee Park; Giwon Lee; Kuk-Jin Yoon

2025 CVPR CVPR 2025

Multi-modal Knowledge Distillation-based Human Trajectory Forecasting

Abstract

Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to 13%. The code is available at github.com/Jaewoo97/KDTF.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jaewoo Jeong , Seohee Lee , Daehee Park , Giwon Lee , Kuk-Jin Yoon

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Trajectory Prediction Machine Learning > Application Areas > Knowledge Distillation Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Knowledge Distillation Artificial Intelligence > Core AI > Multi-Modal Learning Computer Vision > Analysis > Trajectory Prediction

Keywords

knowledge distillation autonomous driving trajectory prediction multi-modal learning human pose estimation trajectory forecasting pedestrian trajectory

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025