TheoremQA: A Theorem-driven Question Answering Dataset

Wenhu Chen; Ming Yin; Max Ku; Pan Lu; Yixin Wan; Xueguang Ma; Jianyu Xu; Xinyi Wang; Tony Xia

2023 EMNLP EMNLP 2023

TheoremQA: A Theorem-driven Question Answering Dataset

Abstract

AbstractThe recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models’ capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4’s capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs’ capabilities to solve challenging science problems.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — program-of-thoughts prompting

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wenhu Chen , Ming Yin , Max Ku , Pan Lu , Yixin Wan , Xueguang Ma , Jianyu Xu , Xinyi Wang , Tony Xia

Topics

Artificial Intelligence > Core AI > Foundation Models Natural Language Processing > Generation > Language Modeling Natural Language Processing > Applications > Question Answering

Keywords

theorem proving question answering chain-of-thought prompting large language model program-of-thoughts prompting

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023