Computer Vision › Analysis ›

Visual Question Answering

70 directly classified papers

Papers per year

Papers

Few-shot Personalized Scanpath Prediction CVPR 2025

Target Scanpath-Guided 360-Degree Image Enhancement AAAI 2025

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? ACL 2025

Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding ACL 2025

Momentum Pseudo-Labeling for Weakly Supervised Phrase Grounding AAAI 2025

EyEar: Learning Audio Synchronized Human Gaze Trajectory Based on Physics-Informed Dynamics AAAI 2025

Analyzing the Sensitivity of Vision Language Models in Visual Question Answering ACL 2025

NLKI: A Lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks EMNLP 2025

ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs’ Capability via Chart Editing ACL 2025

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning EMNLP 2025

Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint EMNLP 2025

Multi-Granular Multimodal Clue Fusion for Meme Understanding AAAI 2025

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA AAAI 2024

Plot Twist: Multimodal Models Don’t Comprehend Simple Chart Details EMNLP 2024

Exploiting the Social-Like Prior in Transformer for Visual Reasoning AAAI 2024

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models EMNLP 2024

JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images NIPS 2024

UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models EMNLP 2024

TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering EMNLP 2024

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions NIPS 2024

Towards Artwork Explanation in Large-scale Vision Language Models ACL 2024

MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding CVPR 2024

FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension EMNLP 2024

ReMI: A Dataset for Reasoning with Multiple Images NIPS 2024

ECHo: A Visio-Linguistic Dataset for Event Causality Inference via Human-Centric Reasoning EMNLP 2023