What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Dongqi Liu; Chenxi Whitehouse; Xi Yu; Louis Mahon; Rohit Saxena; Zheng Zhao; Yifu Qiu; Mirella Lapata; Vera Demberg

2025 ACL ACL 2025

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Abstract

AbstractTransforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.

❓ The Questioner

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — scientific presentation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio