CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering

Shaowei Wang; Lingling Zhang; Longji Zhu; Tao Qin; Kim-Hui Yap; Xinyu Zhang; Jun Liu

2024 CVPR CVPR 2024

CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering

Abstract

Diagram Question Answering (DQA) is a challenging task requiring models to answer natural language questions based on visual diagram contexts. It serves as a crucial basis for academic tutoring technical support and more practical applications. DQA poses significant challenges such as the demand for domain-specific knowledge and the scarcity of annotated data which restrict the applicability of large-scale deep models. Previous approaches have explored external knowledge integration through pre-training but these methods are costly and can be limited by domain disparities. While Large Language Models (LLMs) show promise in question-answering there is still a gap in how to cooperate and interact with the diagram parsing process. In this paper we introduce the Chain-of-Guiding Learning Model for Diagram Question Answering (CoG-DQA) a novel framework that effectively addresses DQA challenges. CoG-DQA leverages LLMs to guide diagram parsing tools (DPTs) through the guiding chains enhancing the precision of diagram parsing while introducing rich background knowledge. Our experimental findings reveal that CoG-DQA surpasses all comparison models in various DQA scenarios achieving an average accuracy enhancement exceeding 5% and peaking at 11% across four datasets. These results underscore CoG-DQA's capacity to advance the field of visual question answering and promote the integration of LLMs into specialized domains.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — diagram question answering

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shaowei Wang , Lingling Zhang , Longji Zhu , Tao Qin , Kim-Hui Yap , Xinyu Zhang , Jun Liu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Question Answering Natural Language Processing > Resources & Methods > Large Language Models

Keywords

visual question answering large language model diagram question answering chain-of-guiding learning diagram parsing tool

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024