诊断视力和语言导航中的环境偏见

论文标题

诊断视力和语言导航中的环境偏见

Diagnosing the Environment Bias in Vision-and-Language Navigation

论文作者

Zhang, Yubo, Tan, Hao, Bansal, Mohit

论文摘要

视觉和语言导航（VLN）要求代理遵循自然语言说明，探索给定的环境并到达所需的目标位置。当代理在没有先验知识的新环境中，这些分步导航指令至关重要。研究VLN的最新作品观察到在看不见的环境（即未在培训中使用的环境）进行测试时的性能下降，这表明神经药物模型对训练环境有很大偏见。尽管此问题被认为是VLN研究中的主要挑战之一，但它仍然研究不足，需要更明确的解释。在这项工作中，我们通过重新分解和更换特征来设计新颖的诊断实验，以研究这种环境偏见的可能原因。我们观察到，语言和基本导航图都没有，但是Resnet特征传达的低级视觉外观直接影响了代理模型，并导致了结果的这种环境偏见。根据这一观察结果，我们探索了几种包含较少低水平视觉信息的语义表示，因此，使用这些功能学到的代理可以更好地将其推广到看不见的测试环境。在不修改基线代理模型及其训练方法的情况下，我们的探索语义特征可显着降低多个数据集（即R2R，R4R和CVDN）上可见和看不见的性能差距，并为先前的最新模型实现竞争性的结果。我们的代码和功能可在以下网址找到：https：//github.com/zhangybzbo/envbiasvln

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations. These step-by-step navigational instructions are crucial when the agent is navigating new environments about which it has no prior knowledge. Most recent works that study VLN observe a significant performance drop when tested on unseen environments (i.e., environments not used in training), indicating that the neural agent models are highly biased towards training environments. Although this issue is considered as one of the major challenges in VLN research, it is still under-studied and needs a clearer explanation. In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias. We observe that neither the language nor the underlying navigational graph, but the low-level visual appearance conveyed by ResNet features directly affects the agent model and contributes to this environment bias in results. According to this observation, we explore several kinds of semantic representations that contain less low-level visual information, hence the agent learned with these features could be better generalized to unseen testing environments. Without modifying the baseline agent model and its training method, our explored semantic features significantly decrease the performance gaps between seen and unseen on multiple datasets (i.e. R2R, R4R, and CVDN) and achieve competitive unseen results to previous state-of-the-art models. Our code and features are available at: https://github.com/zhangybzbo/EnvBiasVLN

下载PDF全文

下载文献需遵守相关版权规定

论文标题