近端政策优化，相对皮尔森差异

论文标题

近端政策优化，相对皮尔森差异

Proximal Policy Optimization with Relative Pearson Divergence

论文作者

Kobayashi, Taisuke

论文摘要

深度强化学习（DRL）最近取得的显着进步代表了稳定和高效学习的政策的正规化。为此，已经引入了一种流行的方法，称为近端策略优化（PPO）。最新和基线策略的PPO夹密度比具有阈值，而其最小化目标尚不清楚。作为PPO的另一个问题，对称阈值是在数值上给出的，而密度比本身则在不对称域中，从而导致策略的不平衡正则化。因此，本文通过考虑相对皮尔逊（RPE）差异（所谓的PPO-RPE）的正则化问题提出了PPO的新变体。这种正规化产生了明显的最小化目标，这将最新政策限制在基准方面。通过其分析，可以得出与阈值和密度比的不对称性一致的直观阈值设计。通过四个基准任务，PPO-RPE在学习策略的任务绩效方面执行或更好或更好。

The recent remarkable progress of deep reinforcement learning (DRL) stands on regularization of policy for stable and efficient learning. A popular method, named proximal policy optimization (PPO), has been introduced for this purpose. PPO clips density ratio of the latest and baseline policies with a threshold, while its minimization target is unclear. As another problem of PPO, the symmetric threshold is given numerically while the density ratio itself is in asymmetric domain, thereby causing unbalanced regularization of the policy. This paper therefore proposes a new variant of PPO by considering a regularization problem of relative Pearson (RPE) divergence, so-called PPO-RPE. This regularization yields the clear minimization target, which constrains the latest policy to the baseline one. Through its analysis, the intuitive threshold-based design consistent with the asymmetry of the threshold and the domain of density ratio can be derived. Through four benchmark tasks, PPO-RPE performed as well as or better than the conventional methods in terms of the task performance by the learned policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题