Papers
10,699 papers found
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
Wenbin Wang, Liang Ding, Minyan Zeng et al.
ICM-Assistant: Instruction-tuning Multimodal Large Language Models for Rule-based Explainable Image Content Moderation
Mengyang Wu, Yuzhi Zhao, Jialun Cao et al.
Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning
Hai-Ming Xu, Qi Chen, Lei Wang et al.
Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models
Yifang Xu, Yunzhuo Sun, Benxiang Zhai et al.
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model
Xu Yuan, Li Zhou, Zenghui Sun et al.
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models
Guosheng Zhang, Keyao Wang, Haixiao Yue et al.
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Jiaxin Zhang, Wentao Yang, Songxuan Lai et al.
Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation
Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan et al.
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
Baichuan Zhou, Haote Yang, Dairong Chen et al.
ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
Jiedong Zhuang, Lu Lu, Ming Dai et al.
Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models
Jean Park, Kuk Jin Jang, Basam Alasaly et al.
PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures
Shreya Shukla, Nakul Sharma, Manish Gupta et al.
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
Xiyao Wang, Yuhang Zhou, Xiaoyu Liu et al.
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
Daechul Ahn, Yura Choi, Youngjae Yu et al.
Unified Hallucination Detection for Multimodal Large Language Models
Xiang Chen, Chenxi Wang, Yida Xue et al.
FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model
Yebin Lee, Imseong Park, Myungjoo Kang
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA
Yue Fan, Jing Gu, Kaiwen Zhou et al.
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Hongliang He, Wenlin Yao, Kaixin Ma et al.
MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception
Yuhao Wang, Yusheng Liao, Heyang Liu et al.
CODIS: Benchmarking Context-dependent Visual Comprehension for Multimodal Large Language Models
Fuwen Luo, Chi Chen, Zihao Wan et al.
Model Composition for Multimodal Large Language Models
Chi Chen, Yiyang Du, Zheng Fang et al.
Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks
Fakhraddin Alwajih, El Moatez Billah Nagoudi, Gagan Bhatia et al.
PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain
Liang Chen, Yichi Zhang, Shuhuai Ren et al.
Can Large Multimodal Models Uncover Deep Semantics Behind Images?
Yixin Yang, Zheng Li, Qingxiu Dong et al.
MLeVLM: Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering
Dexuan Xu, Yanyuan Chen, Jieyi Wang et al.