TM-PATHVQA: 90000+ Textless Multilingual Questions for Medical Visual Question Answering

Tonmoy Rajkhowa; Amartya Roy Chowdhury; Sankalp Nagaonkar; Achyut Mani Tripathi; Mahadeva Prasanna

2024 INTERSPEECH INTERSPEECH 2024

TM-PATHVQA: 90000+ Textless Multilingual Questions for Medical Visual Question Answering

Abstract

In healthcare and medical diagnostics, Visual Question Answering (VQA) may emerge as a pivotal tool in scenarios where analysis of intricate medical images becomes critical for accurate diagnoses. Current text-based VQA systems limit their utility in scenarios where hands-free interaction and accessibility are crucial while performing tasks. A speech-based VQA system may provide a better means of interaction where information can be accessed while performing tasks simultaneously. To this end, this work implements a speech-based VQA system by introducing a Textless Multilingual Pathological VQA (TM-PathVQA) dataset, an expansion of the PathVQA dataset, containing spoken questions in English, German and French. This dataset comprises 98,397 multilingual spoken questions and answers based on 5,004 pathological images along with 70 hours of audio. Finally, this work benchmarks and compares TM-PathVQA systems implemented using various combinations of acoustic and visual features.

🌉 Interdisciplinary Bridge — Computer Vision and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tonmoy Rajkhowa , Amartya Roy Chowdhury , Sankalp Nagaonkar , Achyut Mani Tripathi , Mahadeva Prasanna

Topics

Computer Vision > Domain-Specific > Medical Imaging Natural Language Processing > Applications > Question Answering

Keywords

visual question answering medical imaging speech recognition

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024