Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Models
Deep Learning
›
Models
›
Vision-Language Models
685 directly classified papers
Papers per year
2015: 1
2016: 1
2017: 3
2018: 1
2019: 7
2020: 12
2021: 26
2022: 57
2023: 94
2024: 235
2025: 248
Papers
Show and Guide: Instructional-Plan Grounded Vision and Language Model
EMNLP 2024
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models
EMNLP 2024
A Simple LLM Framework for Long-Range Video Question-Answering
EMNLP 2024
MM-ChatAlign: A Novel Multimodal Reasoning Framework based on Large Language Models for Entity Alignment
EMNLP 2024
Granular Entity Mapper: Advancing Fine-grained Multimodal Named Entity Recognition and Grounding
EMNLP 2024
M5 – A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
EMNLP 2024
Reference-free Hallucination Detection for Large Vision-Language Models
EMNLP 2024
UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition
EMNLP 2024
Is GPT-4V (ision) All You Need for Automating Academic Data Visualization? Exploring Vision-Language Models’ Capability in Reproducing Academic Charts
EMNLP 2024
AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
EMNLP 2024
VPL: Visual Proxy Learning Framework for Zero-Shot Medical Image Diagnosis
EMNLP 2024
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
EMNLP 2024
BiasDora: Exploring Hidden Biased Associations in Vision-Language Models
EMNLP 2024
Multimodal Misinformation Detection by Learning from Synthetic Data with Multimodal LLMs
EMNLP 2024
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models
EMNLP 2024
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding
EMNLP 2024
A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models
EMNLP 2024
Why do LLaVA Vision-Language Models Reply to Images in English?
EMNLP 2024
MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans?
EMNLP 2024
Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP
EMNLP 2024
Unraveling the Truth: Do VLMs really Understand Charts? A Deep Dive into Consistency and Robustness
EMNLP 2024
Unified Generative and Discriminative Training for Multi-modal Large Language Models
NIPS 2024
AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
NIPS 2024
AdanCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer
NIPS 2024
FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding
NIPS 2024
<
1
…
16
17
18
…
28
>