@BENCH: Benchmarking Vision-Language Models for Human-Centered Assistive Technology

Xin Jiang; Junwei Zheng; Ruiping Liu; Jiahang Li; Jiaming Zhang; Sven Matthiesen; Rainer Stiefelhagen

2025 WACV WACV 2025

@BENCH: Benchmarking Vision-Language Models for Human-Centered Assistive Technology

Abstract

As Vision-Language Models (VLMs) advance human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists capable of performing multiple tasks simultaneously. However benchmarking VLMs for ATs remains under-explored. To bridge this gap we first create a novel AT benchmark (@BENCH). Guided by a pre-design user study with PVIs our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation Depth Estimation Optical Character Recognition (OCR) Image Captioning and Visual Question Answering (VQA). Besides we propose a novel AT model (@MODEL) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xin Jiang , Junwei Zheng , Ruiping Liu , Jiahang Li , Jiaming Zhang , Sven Matthiesen , Rainer Stiefelhagen

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Depth Estimation Computer Vision > Generation > Image Captioning Computer Vision > Processing > Image Segmentation

Keywords

visual question answering depth estimation vision-language model panoptic segmentation optical character recognition assistive technology

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025