Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches

Tsutomu Hirao; Naoki Kobayashi; Hidetaka Kamigaito; Manabu Okumura; Akisato Kimura

2024 EMNLP EMNLP 2024

Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches

Abstract

AbstractThis paper tackles a new task: discourse parsing for videos, inspired by text discourse parsing based on Rhetorical Structure Theory (RST). The task aims to construct an RST tree for a video to represent its storyline and illustrate the event relationships. We first construct a benchmark dataset by identifying events with their time spans, providing corresponding captions, and constructing RST trees with events as leaves. We then evaluate baseline approaches to video RST parsing: the ‘parsing after captioning’ framework and parsing via visual features. The results show that a parser using gold captions performed the best, while parsers relying on generated captions performed the worst; a parser using visual features provided intermediate performance. However, we observed that parsing via visual features could be improved by pre-training it with video captioning designed to produce a coherent video story. Furthermore, we demonstrated that RST trees obtained from videos contribute to multimodal summarization consisting of keyframes with texts.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — video discourse parsing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Tsutomu Hirao , Naoki Kobayashi , Hidetaka Kamigaito , Manabu Okumura , Akisato Kimura

Topics

Computer Vision > Analysis > Scene Understanding Computer Vision > Generation > Image Captioning Computer Vision > Processing > Video Understanding Natural Language Processing > Generation > Summarization Natural Language Processing > Applications > Summarization Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

video captioning video understanding visual feature event extraction discourse parsing rhetorical structure theory multimodal summarization video event detection video discourse parsing video storyline analysis

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024