Mind the Query: A Benchmark Dataset towards Text2Cypher Task

Vashu Chauhan; Shobhit Raj; Shashank Mujumdar; Avirup Saha; Anannay Jain

2025 EMNLP EMNLP 2025

Mind the Query: A Benchmark Dataset towards Text2Cypher Task

Abstract

AbstractWe present a high-quality, multi-domain dataset for the Text2Cypher task which is enabling the translation of natural language (NL) questions into executable Cypher queries over graph databases. The dataset comprises 27,529 NL queries and corresponding Cyphers spanning across 11 real-world graph datasets, each accompanied by its corresponding graph database for grounded query execution. To ensure correctness, the queries are validated through a rigorous pipeline combining automated schema, runtime and value checks, along with manual review for logical correctness. Queries are further categorized by complexity to support fine-grained evaluation. We have released our benchmark dataset and code to replicate our data synthesis pipeline on new graph datasets, supporting extensibility and future research for the task of Text2Cypher.

🌉 Interdisciplinary Bridge — Computer Science and Knowledge & Reasoning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Vashu Chauhan , Shobhit Raj , Shashank Mujumdar , Avirup Saha , Anannay Jain

Topics

Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Applications > Question Answering Natural Language Processing > Resources & Methods > Text Representation Knowledge & Reasoning > Representation > Knowledge Graphs Machine Learning > Learning Types > Evaluation Natural Language Processing > Applications > Semantic Parsing Computer Science > Applications > Databases

Keywords

semantic parsing natural language queries knowledge graph benchmark dataset query language graph database query translation natural language interface cypher query natural language query

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025