WikiScenes with Descriptions: Aligning Paragraphs and Sentences with Images in Wikipedia Articles

Özge Alaçam; Ronja Utescher; Hannes Grönner; Judith Sieker; Sina Zarrieß

2024 NAACL NAACL 2024

WikiScenes with Descriptions: Aligning Paragraphs and Sentences with Images in Wikipedia Articles

Abstract

AbstractResearch in Language & Vision rarely uses naturally occurring multimodal documents as Wikipedia articles, since they feature complex image-text relations and implicit image-text alignments. In this paper, we provide one of the first datasets that provides ground-truth annotations of image-text alignments in multi-paragraph multi-image articles. The dataset can be used to study phenomena of visual language grounding in longer documents and assess retrieval capabilities of language models trained on, e.g., captioning data. Our analyses show that there are systematic linguistic differences between the image captions and descriptive sentences from the article’s text and that intra-document retrieval is a challenging task for state-of-the-art models in L&V (CLIP, VILT, MCSE).

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Özge Alaçam , Ronja Utescher , Hannes Grönner , Judith Sieker , Sina Zarrieß

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Information Retrieval Computer Vision > Core AI > Multimodal Learning

Keywords

information retrieval multimodal learning visual language grounding cross-modal retrieval multimodal document image-text alignment

Download PDF

Related papers

Working Alliance Transformer for Psychotherapy Dialogue Classification 2024

Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences 2024

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study 2024

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation 2024

Extractive Summarization with Text Generator 2024