Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Thong Nguyen; Yi Bin; Junbin Xiao; Leigang Qu; Yicong Li; Jay Zhangjie Wu; Cong-Duy Nguyen; See-Kiong Ng; Anh Tuan Luu

2024 ACL ACL 2024

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Abstract

AbstractHumans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Thong Nguyen , Yi Bin , Junbin Xiao , Leigang Qu , Yicong Li , Jay Zhangjie Wu , Cong-Duy Nguyen , See-Kiong Ng , Anh Tuan Luu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Transformers Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Multi-Modal Learning

Keywords

representation learning temporal dynamics self-supervised learning multimodal learning model architecture video understanding model training vision-language model video-language understanding vision language

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024