VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

Honghao Fu; Junlong Ren; Qi Chai; Deheng Ye; Yujun Cai; Hao Wang

2025 EMNLP EMNLP 2025

VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

Abstract

AbstractLarge language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Knowledge & Reasoning and Machine Learning

🧭 Keyword Pioneer — cross-modal knowledge graph

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Honghao Fu , Junlong Ren , Qi Chai , Deheng Ye , Yujun Cai , Hao Wang

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Efficient Computing Knowledge & Reasoning > Representation > Knowledge Graphs Artificial Intelligence > Core AI > Knowledge Graph Computer Vision > Domain-Specific > Robotics

Keywords

object detection embodied ai domain knowledge retrieval-augmented generation agent system domain-specific knowledge agent planning embodied decision-making cross-modal knowledge graph retrieval-based pooling cost-effective training

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025