Temporally Grounding Instructional Diagrams in Unconstrained Videos

Jiahao Zhang; Frederic Z. Zhang; Cristian Rodriguez; Yizhak Ben-Shabat; Anoop Cherian; Stephen Gould

2025 WACV WACV 2025

Temporally Grounding Instructional Diagrams in Unconstrained Videos

Abstract

We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However most existing methods focus on grounding one query at a time ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently the predicted timespans of different step diagrams may overlap considerably or violate the temporal order thus harming the accuracy. In this paper we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically we propose composite queries constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries significantly outperforming existing methods while simultaneously grounding multiple queries.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — composite queries

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiahao Zhang , Frederic Z. Zhang , Cristian Rodriguez , Yizhak Ben-Shabat , Anoop Cherian , Stephen Gould

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Transformers Computer Vision > Processing > Video Understanding

Keywords

semantic segmentation multimodal learning video understanding temporal grounding video grounding temporal localization cross modal composite queries multi-query grounding instructional diagram

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025