Papers
2,653 papers found
Understanding the Visual Projection Space of Multimodal LLMs
Sungheon Jeong, Yoojeong Song, Hyungjoon Kim
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Yan-Bo Lin, Kevin Lin, Zhengyuan Yang et al.
VOCAL: Visual Odometry via ContrAstive Learning
Chi-Yao Huang, Zeel Bhatt, Yezhou Yang
CLIP's Visual Embedding Projector is a Few-shot Cornucopia
Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc et al.
Visual Detector Compression via Location-Aware Discriminant Analysis
Qizhen Lan, Jung Im Choi, Qing Tian
HiMix : Hierarchical Visual-Textual Mixing Network for Lesion Segmentation
Soojin Hwang, Jaeyoon Sim, Won Hwa Kim
See, Record, Do: Automated Generation of UI Workflows from Tutorial Videos
Adam Beauchaine, Craig Shue
ChartQA-X: Generating Explanations for Visual Chart Reasoning
Shamanthak Hegde, Pooyan Fazli, Hasti Seifi
A Computational Approach to Visual Metonymy
Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang
Progressive Visual Refinement for Multi-modal Summarization
Ye Xiong, Hidetaka Kamigaito, Soichiro Murakami et al.
DRIVINGVQA: A Dataset for Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios
Charles Corbière, Simon Roburin, Syrielle Montariol et al.
Seeing Words Differently: Visual Embeddings for Robust English-Arabic Machine Translation
Mahdi Alshaikh Saleh, Irfan Ahmad
Hebrew Diacritics Restoration using Visual Representation
Yair Elboher, Yuval Pinter
AnimatedLLM: Explaining LLMs with Interactive Visualizations
Zdeněk Kasner, Ondrej Dusek
Open-World Object Counting in Videos
Niki Amini-Naieni, Andrew Zisserman
AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs
Boyu Chang, Qi Wang, Xi Guo et al.
VMChill: A Dataset for Fine-Grained Visual-Musical Synergy
Xiaowei Chi, Zeyue Tian, Jialiang Chen et al.
Primary Visual Cortex Inspired Point Cloud Analysis Framework
Jisheng Dang, Delin Deng, Bimei Wang et al.
SCAN: Self-Calibrated AutoregressioN for High-Quality Visual Generation
Zhanzhou Feng, Qingpei Guo, Jingdong Chen et al.
AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection
Bin-Bin Gao, Yue Zhou, Jiangtao Yan et al.
Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?
Jia Li, Wenjie Zhao, Ziru Huang et al.
Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning
Liqin Luo, Guangyao Chen, Xiawu Zheng et al.
Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
Fangyuan Mao, Aiming Hao, Jintao Chen et al.
Next Patch Prediction for AutoRegressive Visual Generation
Yatian Pang, Peng Jin, Shuo Yang et al.
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
Jiaqi Tang, Jianmin Chen, Wei Wei et al.