通过代表和强化学习来提炼计划和控制的层次结构政策

论文标题

通过代表和强化学习来提炼计划和控制的层次结构政策

Distilling a Hierarchical Policy for Planning and Control via Representation and Reinforcement Learning

论文作者

Ha, Jung-Su, Park, Young-Jin, Chae, Hyeok-Joo, Park, Soon-Seo, Choi, Han-Lim

论文摘要

我们提出了一个分层计划和控制框架，使代理可以执行各种任务并灵活适应新任务。拟议的框架，Dish，通过代表和强化学习从一组任务中提取层次政策，而不是为每个特定任务学习单个策略。该框架基于使用低维潜在变量代表高维观测值的潜在变量模型的概念。由此产生的策略由两个层次结构组成：（i）一个计划模块，该模块是一系列潜在意图，这将带来乐观的未来，并且（ii）跨任务共享的反馈控制策略，以执行推断的意图。由于计划是在低维的潜在空间中执行的，因此可以立即使用学习的政策来解决或适应新任务，而无需额外的培训。我们证明了所提出的框架可以学习紧凑的表示（具有197和36维状态特征和36维状态特征和动作的人形生物的3维潜在状态和命令），同时解决了少量的模仿任务，并且由此产生的策略直接适用于其他类型的任务，即在混乱环境中导航。视频：https：//youtu.be/hqsqysuwohg

We present a hierarchical planning and control framework that enables an agent to perform various tasks and adapt to a new task flexibly. Rather than learning an individual policy for each particular task, the proposed framework, DISH, distills a hierarchical policy from a set of tasks by representation and reinforcement learning. The framework is based on the idea of latent variable models that represent high-dimensional observations using low-dimensional latent variables. The resulting policy consists of two levels of hierarchy: (i) a planning module that reasons a sequence of latent intentions that would lead to an optimistic future and (ii) a feedback control policy, shared across the tasks, that executes the inferred intention. Because the planning is performed in low-dimensional latent space, the learned policy can immediately be used to solve or adapt to new tasks without additional training. We demonstrate the proposed framework can learn compact representations (3- and 1-dimensional latent states and commands for a humanoid with 197- and 36-dimensional state features and actions) while solving a small number of imitation tasks, and the resulting policy is directly applicable to other types of tasks, i.e., navigation in cluttered environments. Video: https://youtu.be/HQsQysUWOhg

下载PDF全文

下载文献需遵守相关版权规定

论文标题