Deep Submodular Optimization and LLM for Multimodal Content Extraction and Automatic Poster Generation from Long Document

Vijay Jaisankar; Sambaran Bandyopadhyay; Kalp Vyas; Varre Suman Chaitanya; Shwetha Somasundaram

2025 AAAI AAAI 2025

Deep Submodular Optimization and LLM for Multimodal Content Extraction and Automatic Poster Generation from Long Document

Abstract

Abstract A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Vijay Jaisankar , Sambaran Bandyopadhyay , Kalp Vyas , Varre Suman Chaitanya , Shwetha Somasundaram

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Generation > Summarization Deep Learning > Learning Types > Multi-Modal Learning

Keywords

submodular optimization document summarization multimodal learning image-text alignment content extraction large language model template generation

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025