← Models

Deep Learning › Models ›

Multi-Modal Learning

115 directly classified papers

Papers per year

Papers

VisFinEval: A Scenario-Driven Chinese Multimodal Benchmark for Holistic Financial Understanding EMNLP 2025

Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs EMNLP 2025

VoiceBBQ: Investigating Effect of Content and Acoustics in Social Bias of Spoken Language Model EMNLP 2025

DELOC: Document Element Localizer EMNLP 2025

EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos EMNLP 2025

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model EMNLP 2025

Self-Improvement in Multimodal Large Language Models: A Survey EMNLP 2025

CONSTRUCTURE: Benchmarking CONcept STRUCTUre REasoning for Multimodal Large Language Models EMNLP 2024

Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models EMNLP 2024

Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation EMNLP 2024

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever EMNLP 2024

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs NIPS 2024

Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models NIPS 2024

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models NIPS 2024

Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing? ACL 2024

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation ACL 2024

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives ACL 2024

MemeMQA: Multimodal Question Answering for Memes via Rationale-Based Inferencing ACL 2024

SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models ACL 2024

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models NIPS 2024

Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding NIPS 2024

MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts EMNLP 2024

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer EMNLP 2024

Kiss up, Kick down: Exploring Behavioral Changes in Multi-modal Large Language Models with Assigned Visual Personas EMNLP 2024

UNICORN: A Unified Causal Video-Oriented Language-Modeling Framework for Temporal Video-Language Tasks EMNLP 2024