MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

Aman Khullar; Udit Arora

2020 EMNLP EMNLP 2020

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

Abstract

AbstractThis paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities – text, audio and video – in a multimodal video. Prior work on multimodal abstractive text summarization only utilized information from the text and video modalities. We examine the usefulness and challenges of deriving information from the audio modality and present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges by letting the model pay more attention to the text modality. MAST outperforms the current state of the art model (video-text) by 2.51 points in terms of Content F1 score and 1.00 points in terms of Rouge-L score on the How2 dataset for multimodal language understanding.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — multimodal abstractive summarization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Aman Khullar , Udit Arora

Topics

Machine Learning > Learning Types > Self-Supervised Learning Natural Language Processing > Generation > Summarization Natural Language Processing > Applications > Summarization Computer Vision > Core AI > Multimodal Learning Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Multimodal Learning Computer Vision > Applications > Video Understanding

Keywords

video understanding sequence-to-sequence model hierarchical attention audio modality multimodal abstractive summarization multimodal language understanding trimodal learning trimodal hierarchical attention video-text summarization

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020