SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)

Atharva Naik; Yash Parag Butala; Navaneethan Vaikunthan; Raghav Kapoor

2024 AAAI AAAI 2024

SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)

Abstract

Abstract When humans are posed with a difficult problem, they often approach it by identifying key skills, honing them, and finally effectively combining them. We propose a novel method and apply it for the VizWiz VQA task to predict the visual skills needed to answer a question, and leverage expert modules to produce intermediary outputs and fuse them in a skill-aware manner. Unlike prior works in visual question-answering (VQA) that use intermediate outputs such as detected objects and Optical Character Recognition (OCR), our approach explicitly guides the model with a skill embedding on what to focus on. While our results show that using skill-aware fusion outperforms skill-unaware models for only a subset of questions, we believe our results provide interesting directions for future work. We also release our code, model, and illustrative demonstrations for future research purposes.

🧭 Keyword Pioneer — skill fusion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Atharva Naik , Yash Parag Butala , Navaneethan Vaikunthan , Raghav Kapoor

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Planning

Keywords

visual question answering multimodal fusion visual skill skill fusion

Download PDF

Related papers

Goal Alignment: Re-analyzing Value Alignment Problems Using Human-Aware AI 2024

Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables 2024

Suppressing Uncertainty in Gaze Estimation 2024

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation 2024

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification 2024