DIUSum: Dynamic Image Utilization for Multimodal Summarization

Min Xiao; Junnan Zhu; Feifei Zhai; Yu Zhou; Chengqing Zong

2024 AAAI AAAI 2024

DIUSum: Dynamic Image Utilization for Multimodal Summarization

Abstract

Abstract Existing multimodal summarization approaches focus on fusing image features in the encoding process, ignoring the individualized needs for images when generating different summaries. However, whether intuitively or empirically, not all images can improve summary quality. Therefore, we propose a novel Dynamic Image Utilization framework for multimodal Summarization (DIUSum) to select and utilize valuable images for summarization. First, to predict whether an image helps produce a high-quality summary, we propose an image selector to score the usefulness of each image. Second, to dynamically utilize the multimodal information, we incorporate the hard and soft guidance from the image selector. Under the guidance, the image information is plugged into the decoder to generate a summary. Experimental results have shown that DIUSum outperforms multiple strong baselines and achieves SOTA on two public multimodal summarization datasets. Further analysis demonstrates that the image selector can reflect the improved level of summary quality brought by the images.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — dynamic image utilization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning

Authors

Min Xiao , Junnan Zhu , Feifei Zhai , Yu Zhou , Chengqing Zong

Topics

Machine Learning > Application Areas > Data Augmentation Deep Learning > Techniques > Model Architecture Computer Vision > Generation > Image Captioning Natural Language Processing > Applications > Summarization Deep Learning > Learning Types > Representation Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

image feature multimodal summarization summary generation image selection dynamic image utilization image guidance hard and soft guidance image selector

Download PDF

Related papers

Goal Alignment: Re-analyzing Value Alignment Problems Using Human-Aware AI 2024

Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables 2024

Suppressing Uncertainty in Gaze Estimation 2024

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation 2024

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification 2024