论文标题
通过模型错误指定从观察中学习强大的学习
Robust Learning from Observation with Model Misspecification
论文作者
论文摘要
在指定奖励功能时,模仿学习(IL)是机器人系统中培训政策的流行范式。但是,尽管IL算法取得了成功,但它们施加了某种不现实的要求,即专家示范必须来自相同的领域,在该领域中,将学习新的模仿策略。我们考虑了一个实用的环境,在该环境中,(i)将实际(部署)环境中的仅限国家专家演示提供给学习者,(ii)在模拟(培训)环境中,对仿真学习者的政策进行了培训,其过渡动态与真实环境的过渡略有不同,并且(iii)学习者在培训阶段在示威范围以外的培训阶段无法访问真实的环境。当前的大多数IL方法,例如生成对抗性模仿学习及其仅在国家的变体中,无法模仿上述设置下的最佳专家行为。通过利用强大的增强学习(RL)文献和基于最近的对抗模仿方法的洞察力,我们提出了一种强大的IL算法来学习可以有效地转移到真实环境而无需微调的政策。此外,我们从经验上证明了连续控制的基准测试表明,我们的方法在实际环境中的零摄像转移性能和在不同的测试条件下的稳健性能方面优于最先进的国际IL方法。
Imitation learning (IL) is a popular paradigm for training policies in robotic systems when specifying the reward function is difficult. However, despite the success of IL algorithms, they impose the somewhat unrealistic requirement that the expert demonstrations must come from the same domain in which a new imitator policy is to be learned. We consider a practical setting, where (i) state-only expert demonstrations from the real (deployment) environment are given to the learner, (ii) the imitation learner policy is trained in a simulation (training) environment whose transition dynamics is slightly different from the real environment, and (iii) the learner does not have any access to the real environment during the training phase beyond the batch of demonstrations given. Most of the current IL methods, such as generative adversarial imitation learning and its state-only variants, fail to imitate the optimal expert behavior under the above setting. By leveraging insights from the Robust reinforcement learning (RL) literature and building on recent adversarial imitation approaches, we propose a robust IL algorithm to learn policies that can effectively transfer to the real environment without fine-tuning. Furthermore, we empirically demonstrate on continuous-control benchmarks that our method outperforms the state-of-the-art state-only IL method in terms of the zero-shot transfer performance in the real environment and robust performance under different testing conditions.