Entity-aware and Motion-aware Transformers for Language-driven Action Localization

Shuo Yang; Xinxiao Wu

2022 IJCAI IJCAI 2022

Entity-aware and Motion-aware Transformers for Language-driven Action Localization

Abstract

Language-driven action localization in videos is a challenging task that involves not only visual-linguistic matching but also action boundary prediction. Recent progress has been achieved through aligning language queries to video segments, but estimating precise boundaries is still under-explored. In this paper, we propose entity-aware and motion-aware Transformers that progressively localize actions in videos by first coarsely locating clips with entity queries and then finely predicting exact boundaries in a shrunken temporal region with motion queries. The entity-aware Transformer incorporates the textual entities into visual representation learning via cross-modal and cross-frame attentions to facilitate attending action-related video clips. The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales via integrating long short-term memory into the self-attention module to further improve the precision of action boundary prediction. Extensive experiments on the Charades-STA and TACoS datasets demonstrate that our method achieves better performance than existing methods.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Machine Learning, Natural Language Processing

Authors

Shuo Yang , Xinxiao Wu

Topics

Deep Learning > Architectures > Transformers Computer Vision > Analysis > Action Recognition

Keywords

action localization

Download PDF

Related papers

Better Collective Decisions via Uncertainty Reduction 2022

Mixed Strategies for Security Games with General Defending Requirements 2022

Achieving Envy-Freeness with Limited Subsidies under Dichotomous Valuations 2022

Distortion in Voting with Top-t Preferences 2022

Let’s Agree to Agree: Targeting Consensus for Incomplete Preferences through Majority Dynamics 2022