COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

Siqi Zhang; Yanyuan Qiao; Qunbo Wang; Zike Yan; Qi Wu; Zhihua Wei; Jing Liu

2025 ICCV ICCV 2025

COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

Abstract

Vision-and-Language Navigation (VLN) tasks have gained prominence within artificial intelligence research due to their potential application in fields like home assistants. Many contemporary VLN approaches, while based on transformer architectures, have increasingly incorporated additional components such as external knowledge bases or map information to enhance performance. These additions, while boosting performance, also lead to larger models and increased computational costs. In this paper, to achieve both high performance and low computational costs, we propose a novel architecture with the **co**mbination of **s**elective **m**em**o**rization (COSMO), which integrates state-space modules (SSMs) and transformer modules. However, direct application of SSMs in VLN results in significant performance degradation. Therefore, we propose two VLN-customized selective state space modules: the Round Selective Scan (RSS) and the Cross-modal Selective State Space Module (CS3). RSS facilitates comprehensive inter-modal interactions within a single scan, while the CS3 module adapts the selective state space module into a dual-stream architecture, thereby enhancing the acquisition of cross-modal interactions. Experimental validations on three mainstream VLN benchmarks, REVERIE, R2R, and R2R-CE, not only demonstrate competitive navigation performance of our model but also show a significant reduction in computational costs. Code is available at \href https://github.com/siqiZ805/VLN-COSMO.git VLN-COSMO .

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Siqi Zhang , Yanyuan Qiao , Qunbo Wang , Zike Yan , Qi Wu , Zhihua Wei , Jing Liu

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Planning Deep Learning > Architectures > Transformers

Keywords

state-space model computational cost cross-modal interaction vision-and-language navigation

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025