物理衍生物：通过物理前向范围计算策略梯度

论文标题

物理衍生物：通过物理前向范围计算策略梯度

Physical Derivatives: Computing policy gradients by physical forward-propagation

论文作者

Mehrjou, Arash, Soleymani, Ashkan, Bauer, Stefan, Schölkopf, Bernhard

论文摘要

无模型和基于模型的增强学习是频谱的两个末端。在没有动态模型的情况下学习良好的政策可能会非常昂贵。学习系统的动态模型可以降低学习政策的成本，但如果不准确，它也会引入偏见。我们提出了一个中间立场，其中学习了轨迹相对于参数扰动的敏感性而不是过渡模型。这使我们能够在不知道实际模型的情况下预测物理系统的局部行为。我们在广泛的实验中分析了定制物理机器人的方法，并在实践中显示了该方法的可行性。我们调查将我们的方法应用于物理系统并向它们提出解决方案时的潜在挑战。

Model-free and model-based reinforcement learning are two ends of a spectrum. Learning a good policy without a dynamic model can be prohibitively expensive. Learning the dynamic model of a system can reduce the cost of learning the policy, but it can also introduce bias if it is not accurate. We propose a middle ground where instead of the transition model, the sensitivity of the trajectories with respect to the perturbation of the parameters is learned. This allows us to predict the local behavior of the physical system around a set of nominal policies without knowing the actual model. We assay our method on a custom-built physical robot in extensive experiments and show the feasibility of the approach in practice. We investigate potential challenges when applying our method to physical systems and propose solutions to each of them.

下载PDF全文

下载文献需遵守相关版权规定

论文标题