MSRTrack: LLM-Powered Object Tracking with Motion and Semantic Reasoning

Tong Shen; Di Wang; José M. F. Moura

2026 WACV WACV 2026

MSRTrack: LLM-Powered Object Tracking with Motion and Semantic Reasoning

Abstract

State-of-the-art object trackers primarily model appearance relations between the image template and the search region with Siamese networks. However, this well-established approach has a limited ability to leverage both motion and semantic cues of the target object, leading to increasing errors in challenging scenarios like drastic appearance changes and similar-looking distractors. To address the above weaknesses, we propose a novel tracking framework with Motion and Semantic Reasoning (MSRTrack), integrating short-term motion modeling and distinctive semantic features for robust tracking across diverse conditions. Powered by vision large language models (VLLMs) and the Segment Anything Model 2 (SAM2), MSRTrack identifies unique semantic attributes of the target, exploits motion cues across consecutive frames, and complements appearance-based trackers with strong semantic and dynamic reasoning capabilities. Unlike previous vision language tracking (VLT) methods that rely on broad captioning, MSRTrack automatically focuses on a concise set of key semantic attributes of the target, substantially improving target lost recovery and distractor rejection. MSRTrack achieves state-of-the-art performance across multiple tracking benchmarks, with 2.2% improvement on the LaSOT dataset, 9.5% improvement on the VastTrack dataset, and 1.4% on the TNL2K dataset.

🧭 Keyword Pioneer — target recovery

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio