What’s Different between Visual Question Answering for Machine “Understanding” Versus for Accessibility?

Yang Trista Cao; Kyle Seelman; Kyungjun Lee; Hal Daume III

2022 IJCNLP IJCNLP 2022

What’s Different between Visual Question Answering for Machine “Understanding” Versus for Accessibility?

Abstract

AbstractIn visual question answering (VQA), a machine must answer a question given an associated image. Recently, accessibility researchers have explored whether VQA can be deployed in a real-world setting where users with visual impairments learn about their environment by capturing their visual surroundings and asking questions. However, most of the existing benchmarking datasets for VQA focus on machine “understanding” and it remains unclear how progress on those datasets corresponds to improvements in this real-world use case. We aim to answer this question by evaluating discrepancies between machine “understanding” datasets (VQA-v2) and accessibility datasets (VizWiz) by evaluating a variety of VQA models. Based on our findings, we discuss opportunities and challenges in VQA for accessibility and suggest directions for future work.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Natural Language Processing

🐣 Hot Topic Early Bird — image understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yang Trista Cao , Kyle Seelman , Kyungjun Lee , Hal Daume III

Topics

Artificial Intelligence > Core AI > Human-AI Interaction Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Scene Understanding Natural Language Processing > Applications > Visual Question Answering Computer Vision > Analysis > Visual Question Answering

Keywords

benchmark evaluation visual question answering multimodal learning image understanding visual impairment machine understanding

Download PDF

Related papers

Chasing the Tail with Domain Generalization: A Case Study on Frequency-Enriched Datasets 2022

Double Trouble: How to not Explain a Text Classifier’s Decisions Using Counterfactuals Synthesized by Masked Language Models? 2022

Leveraging Key Information Modeling to Improve Less-Data Constrained News Headline Generation via Duality Fine-Tuning 2022

Graph-augmented Learning to Rank for Querying Large-scale Knowledge Graph 2022

Missing Modality meets Meta Sampling (M3S): An Efficient Universal Approach for Multimodal Sentiment Analysis with Missing Modality 2022