← Models

Deep Learning › Models ›

Vision-Language Models

685 directly classified papers

Papers per year

Papers

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer CVPR 2025

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination AAAI 2025

Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation CVPR 2025

Exploring the Better Multimodal Synergy Strategy for Vision-Language Models AAAI 2025

Grounded, or a Good Guesser? A Per-Question Balanced Dataset to Separate Blind from Grounded Models for Embodied Question Answering ACL 2025

BiMAC: Bidirectional Multimodal Alignment in Contrastive Learning AAAI 2025

Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data ACL 2025

A-VL: Adaptive Attention for Large Vision-Language Models AAAI 2025

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models CVPR 2025

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating ACL 2025

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage ACL 2025

Visual Evidence Prompting Mitigates Hallucinations in Large Vision-Language Models ACL 2025

Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding ACL 2025

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning ACL 2025

Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models ACL 2025

VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues ACL 2025

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models ACL 2025

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification ACL 2025

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning ACL 2025

Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts ACL 2025

Aligning VLM Assistants with Personalized Situated Cognition ACL 2025

CADReview: Automatically Reviewing CAD Programs with Error Detection and Correction ACL 2025

Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder ACL 2025

HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models ACL 2025

Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference CVPR 2025