DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

Thong Nguyen; Xiaobao Wu; Xinshuai Dong; Cong-Duy Nguyen; See-Kiong Ng; Anh Luu

2023 EMNLP EMNLP 2023

DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

Abstract

AbstractTemporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — moment-query distribution

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Thong Nguyen , Xiaobao Wu , Xinshuai Dong , Cong-Duy Nguyen , See-Kiong Ng , Anh Luu

Topics

Deep Learning > Architectures > Transformers Deep Learning > Models > Generative Models Computer Vision > Generation > Video Generation Machine Learning > Optimization & Theory > Probabilistic Modeling Computer Vision > Analysis > Video Understanding Artificial Intelligence > Core AI > Natural Language Processing Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

attention mechanism energy-based model video moment localization exponential moving average temporal language grounding energy-based modeling moment-query distribution

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023