← Models

Deep Learning › Models ›

Vision-Language Models

685 directly classified papers

Papers per year

Papers

Show and Guide: Instructional-Plan Grounded Vision and Language Model EMNLP 2024

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models EMNLP 2024

A Simple LLM Framework for Long-Range Video Question-Answering EMNLP 2024

MM-ChatAlign: A Novel Multimodal Reasoning Framework based on Large Language Models for Entity Alignment EMNLP 2024

Granular Entity Mapper: Advancing Fine-grained Multimodal Named Entity Recognition and Grounding EMNLP 2024

M5 – A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks EMNLP 2024

Reference-free Hallucination Detection for Large Vision-Language Models EMNLP 2024

UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition EMNLP 2024

Is GPT-4V (ision) All You Need for Automating Academic Data Visualization? Exploring Vision-Language Models’ Capability in Reproducing Academic Charts EMNLP 2024

AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models EMNLP 2024

VPL: Visual Proxy Learning Framework for Zero-Shot Medical Image Diagnosis EMNLP 2024

MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding EMNLP 2024

BiasDora: Exploring Hidden Biased Associations in Vision-Language Models EMNLP 2024

Multimodal Misinformation Detection by Learning from Synthetic Data with Multimodal LLMs EMNLP 2024

Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models EMNLP 2024

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding EMNLP 2024

A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models EMNLP 2024

Why do LLaVA Vision-Language Models Reply to Images in English? EMNLP 2024

MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans? EMNLP 2024

Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP EMNLP 2024

Unraveling the Truth: Do VLMs really Understand Charts? A Deep Dive into Consistency and Robustness EMNLP 2024

Unified Generative and Discriminative Training for Multi-modal Large Language Models NIPS 2024

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation NIPS 2024

AdanCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer NIPS 2024

FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding NIPS 2024