Astra: Efficient Transformer Architecture and Contrastive Dynamics Learning for Embodied Instruction Following

YueEn Ma; Dafeng Chi; Shiguang Wu; Yuecheng Liu; Yuzheng Zhuang; Irwin King

2025 EMNLP EMNLP 2025

Astra: Efficient Transformer Architecture and Contrastive Dynamics Learning for Embodied Instruction Following

Abstract

AbstractVision-language-action models have gained significant attention for their ability to model multimodal sequences in embodied instruction following tasks. However, most existing models rely on causal attention, which we find suboptimal for processing sequences composed of interleaved segments from different modalities. In this paper, we introduce Astra, a novel Transformer architecture featuring trajectory attention and learnable action queries, designed to efficiently process segmented multimodal trajectories and predict actions for imitation learning. Furthermore, we propose a contrastive dynamics learning objective to enhance the model’s understanding of environment dynamics and multimodal alignment, complementing the primary behavior cloning objective. Through extensive experiments on three large-scale robot manipulation benchmarks, Astra demonstrates substantial performance improvements over previous models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — contrastive dynamics learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

YueEn Ma , Dafeng Chi , Shiguang Wu , Yuecheng Liu , Yuzheng Zhuang , Irwin King

Topics

Artificial Intelligence > Core AI > Agent Systems Machine Learning > Learning Types > Contrastive Learning Deep Learning > Architectures > Transformers

Keywords

transformer architecture imitation learning embodied instruction following trajectory attention contrastive dynamics learning

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025