From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning

Chalamalasetti Kranti; Sherzod Hakimov; David Schlangen

2025 AACL AACL 2025

From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning

Abstract

AbstractInstruction-tuned large language models (LLMs) have shown strong performance on a variety of tasks; however, generalizing from synthetic to human-authored instructions in grounded environments remains a challenge for them. In this work, we study generalization challenges in spatial grounding tasks where models interpret and translate instructions for building object arrangements on a 2.5D grid. We fine-tune LLMs using only synthetic instructions and evaluate their performance on a benchmark dataset containing both synthetic and human-authored instructions. Our results reveal that while models generalize well on simple tasks, their performance degrades significantly on more complex tasks. We present a detailed error analysis of the gaps in instruction generalization.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — instruction interpretation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Chalamalasetti Kranti , Sherzod Hakimov , David Schlangen

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Learning Paradigms > Transfer Learning Natural Language Processing > Applications > Intent Classification

Keywords

spatial reasoning synthetic instruction grounded environment instruction-tuned large language model instruction interpretation

Download PDF

Related papers

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems 2025

Enhancing Training Data Quality through Influence Scores for Generalizable Classification: A Case Study on Sexism Detection 2025

CtrlShift: Steering Language Models for Dense Quotation Retrieval with Dynamic Prompts 2025

A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics 2025