基于物理学的基于模型的增强学习

论文标题

基于物理学的基于模型的增强学习

Physics-Informed Model-Based Reinforcement Learning

论文作者

Ramesh, Adithya, Ravindran, Balaraman

论文摘要

我们将强化学习（RL）应用于机器人任务。传统RL算法的缺点之一是它们的样本效率差。提高样品效率的一种方法是基于模型的RL。在基于模型的RL算法中，我们学习了环境的模型，本质上是其过渡动态和奖励功能，使用它来生成虚构的轨迹并通过它们反向传播以更新策略，从而利用了模型的不同性。直观地，学习更准确的模型应导致更好的基于模型的RL性能。最近，通过利用基础物理的结构来开发更好的深层神经网络动力学模型，人们越来越兴趣。我们专注于无接触的僵化身体运动的机器人系统。我们比较了基于模型的RL算法的两个版本，一种使用标准的基于神经网络的动力学模型，另一个使用了基于更准确的，物理知识的神经网络的动力学模型。我们表明，在基于模型的RL中，模型精度主要在对初始条件敏感的环境中，其中数值错误会迅速积累。在这些环境中，我们算法的物理信息版本明显提高了平均返回和样本效率。在对初始条件不敏感的环境中，我们的算法的两个版本都达到了相似的平均返回，而物理知识的版本可实现更好的样品效率。我们还表明，在具有挑战性的环境中，基于物理学的基于模型的RL比最先进的无模型RL算法（例如软actor-Critic）取得更好的平均返回，因为它在分析上计算了策略梯度，而后者通过采样估算了它。

We apply reinforcement learning (RL) to robotics tasks. One of the drawbacks of traditional RL algorithms has been their poor sample efficiency. One approach to improve the sample efficiency is model-based RL. In our model-based RL algorithm, we learn a model of the environment, essentially its transition dynamics and reward function, use it to generate imaginary trajectories and backpropagate through them to update the policy, exploiting the differentiability of the model. Intuitively, learning more accurate models should lead to better model-based RL performance. Recently, there has been growing interest in developing better deep neural network based dynamics models for physical systems, by utilizing the structure of the underlying physics. We focus on robotic systems undergoing rigid body motion without contacts. We compare two versions of our model-based RL algorithm, one which uses a standard deep neural network based dynamics model and the other which uses a much more accurate, physics-informed neural network based dynamics model. We show that, in model-based RL, model accuracy mainly matters in environments that are sensitive to initial conditions, where numerical errors accumulate fast. In these environments, the physics-informed version of our algorithm achieves significantly better average-return and sample efficiency. In environments that are not sensitive to initial conditions, both versions of our algorithm achieve similar average-return, while the physics-informed version achieves better sample efficiency. We also show that, in challenging environments, physics-informed model-based RL achieves better average-return than state-of-the-art model-free RL algorithms such as Soft Actor-Critic, as it computes the policy-gradient analytically, while the latter estimates it through sampling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题