ImageSet2Text: Describing Sets of Images Through Text

Piera Riccio; Francesco Galati; Kajetan Schweighofer; Noa Garcia; Nuria M Oliver

2026 AAAI AAAI 2026

ImageSet2Text: Describing Sets of Images Through Text

Abstract

Abstract In the era of large-scale visual data, understanding collections of images is a challenging yet important task. To this end, we introduce ImageSet2Text, a novel method to automatically generate natural language descriptions of image sets. Based on large language models, visual-question answering chains, an external lexical graph, and CLIP-based verification, ImageSet2Text iteratively extracts key concepts from image subsets and organizes them into a structured concept graph. We conduct extensive experiments evaluating the quality of the generated descriptions in terms of accuracy, completeness, and user satisfaction. We also examine the method's behavior through ablation studies, scalability assessments, and failure analyses. Results demonstrate that ImageSet2Text combines data-driven AI and symbolic representations to reliably summarize large image collections for a wide range of applications.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — set summarization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Piera Riccio , Francesco Galati , Kajetan Schweighofer , Noa Garcia , Nuria M Oliver

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Natural Language Processing > Generation > Summarization

Keywords

visual question answering concept graphs large language model set summarization image set understanding clip verification

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026