您可能不需要PPO中的比率剪辑

论文标题

您可能不需要PPO中的比率剪辑

You May Not Need Ratio Clipping in PPO

论文作者

Sun, Mingfei, Kurin, Vitaly, Liu, Guoqing, Devlin, Sam, Qin, Tao, Hofmann, Katja, Whiteson, Shimon

论文摘要

近端策略优化（PPO）方法通过迭代执行具有一组采样数据的替代目标的多个迷你批次优化时期来学习策略。比率剪辑PPO是一种流行的变体，它缩小了目标策略与用于收集样本的策略之间的概率比率。比率裁剪产生了对原始替代目标的悲观估计，并且已被证明对强大的性能至关重要。我们在本文中表明，这种比率剪辑可能不是一个不错的选择，因为它无法有效地约束比率。相反，人们可以直接优化多个时期的原始替代目标。关键是找到适当的条件，以尽早停止每次迭代中的优化时期。我们的理论分析阐明了如何确定何时停止优化时期，并调用所得算法的早期停止策略优化（ESPO）。我们将ESPO与PPO进行了许多连续控制任务的比较，并表明ESPO明显胜过PPO。此外，我们表明ESPO可以轻松地扩展到许多工人的分布式培训，并提供出色的绩效。

Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data. Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples. Ratio clipping yields a pessimistic estimate of the original surrogate objective, and has been shown to be crucial for strong performance. We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios. Instead, one can directly optimize the original surrogate objective for multiple epochs; the key is to find a proper condition to early stop the optimization epoch in each iteration. Our theoretical analysis sheds light on how to determine when to stop the optimization epoch, and call the resulting algorithm Early Stopping Policy Optimization (ESPO). We compare ESPO with PPO across many continuous control tasks and show that ESPO significantly outperforms PPO. Furthermore, we show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题