OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities

Anton Razzhigaev; Maxim Kurkin; Elizaveta Goncharova; Irina Abdullaeva; Anastasia Lysenko; Alexander Panchenko; Andrey Kuznetsov; Denis Dimitrov

2024 EMNLP EMNLP 2024

OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities

Abstract

AbstractWe introduce OmniDialog — the first trimodal comprehensive benchmark grounded in a knowledge graph (Wikidata) to evaluate the generalization of Large Multimodal Models (LMMs) across three modalities. Our benchmark consists of more than 4,000 dialogues, each averaging 10 turns, all annotated and cross-validated by human experts. The dialogues in our dataset are designed to prevent shortcut learning by incorporating various formats and misleading or irrelevant multimodal cues. We also evaluate both multimodal and unimodal models to gain insights into how they process modality inputs introduced in the conversation.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — cross-modal generalization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Anton Razzhigaev , Maxim Kurkin , Elizaveta Goncharova , Irina Abdullaeva , Anastasia Lysenko , Alexander Panchenko , Andrey Kuznetsov , Denis Dimitrov

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Generation > Dialogue Systems

Keywords

knowledge graph large multimodal model zero-shot transfer dialogue system cross-modal generalization

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024