← Models

Deep Learning › Models ›

Vision-Language Models

685 directly classified papers

Papers per year

Papers

Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow AAAI 2025

DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models WACV 2025

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP AAAI 2025

Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision AAAI 2025

CADReview: Automatically Reviewing CAD Programs with Error Detection and Correction ACL 2025

Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation CVPR 2025

IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis AAAI 2025

Exploring the Better Multimodal Synergy Strategy for Vision-Language Models AAAI 2025

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models AAAI 2025

Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection AAAI 2025

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking AAAI 2025

Position-Aware Guided Point Cloud Completion with CLIP Model AAAI 2025

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning AAAI 2025

Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents CVPR 2025

KPL: Training-Free Medical Knowledge Mining of Vision-Language Models AAAI 2025

Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning AAAI 2025

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating ACL 2025

LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba AAAI 2025

Explanation Bottleneck Models AAAI 2025

PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures AAAI 2025

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning ACL 2025

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion WACV 2025

Rethinking High-speed Image Reconstruction Framework with Spike Camera AAAI 2025

Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection AAAI 2025

Defining and Evaluating Visual Language Models’ Basic Spatial Abilities: A Perspective from Psychometrics ACL 2025