Vision Language Navigation with Knowledge-driven Environmental Dreamer

Vision Language Navigation with Knowledge-driven Environmental Dreamer

Fengda Zhu, Vincent CS Lee, Xiaojun Chang, Xiaodan Liang

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 1840-1848. https://doi.org/10.24963/ijcai.2023/204

Vision-language navigation (VLN) requires an agent to perceive visual observation in a house scene and navigate step-by-step following natural language instruction. Due to the high cost of data annotation and data collection, current VLN datasets provide limited instruction-trajectory data samples. Learning vision-language alignment for VLN from limited data is challenging since visual observation and language instruction are both complex and diverse. Previous works only generate augmented data based on original scenes while failing to generate data samples from unseen scenes, which limits the generalization ability of the navigation agent. In this paper, we introduce the Knowledge-driven Environmental Dreamer (KED), a method that leverages the knowledge of the embodied environment and generates unseen scenes for a navigation agent to learn. Generating an unseen environment with texture consistency and structure consistency is challenging. To address this problem, we incorporate three knowledge-driven regularization objectives into the KED and adopt a reweighting mechanism for self-adaptive optimization. Our KED method is able to generate unseen embodied environments without extra annotations. We use KED to successfully generate 270 houses and 500K instruction-trajectory pairs. The navigation agent with the KED method outperforms the state-of-the-art methods on various VLN benchmarks, such as R2R, R4R, and RxR. Both qualitative and quantitative experiments prove that our proposed KED method is able to high-quality augmentation data with texture consistency and structure consistency.
Keywords:
Computer Vision: CV: Vision and languageĀ 
Robotics: ROB: Robotics and vision