与类似的脱离探索的非政策深度加固学习

论文标题

与类似的脱离探索的非政策深度加固学习

Off-Policy Deep Reinforcement Learning with Analogous Disentangled Exploration

论文作者

Liu, Anji, Liang, Yitao, Broeck, Guy Van den

论文摘要

非政策强化学习（RL）与执行另一项收集经验样本的政策有关学习有益的政策。尽管前者（即目标政策）是有益的，但表现不佳（在大多数情况下，确定性），但在后者的任务中表现良好，相比之下，需要具有指导和有效探索的表达政策（即行为政策）。与大多数在最优性和表达性之间取消权衡的方法相反，分离的框架明确地将两个目标解散了，每个目标都由一个独特的独立政策来处理。尽管能够根据自己的目标自由设计和优化这两个政策，但天真地解散它们可能会导致学习效率低下或稳定问题。为了减轻这个问题，我们提出的类似方法分解了演员批评（ADAC）设计了类似的演员和批评者。具体而言，ADAC利用有关Stein变化梯度下降（SVGD）的关键属性来限制基于表达能量的行为政策，以实现目标的探索。此外，引入了一个类似的评论家对，以原则上的方式纳入了内在的奖励，并在整体学习稳定性和有效性方面保证了理论上的保证。我们对14个连续控制任务进行经验评估仅环境奖励的ADAC，并在其中10个任务中报告最新的。我们进一步证明了ADAC，与固有的奖励配对时，在探索挑战任务中的替代方案优于替代方案。

Off-policy reinforcement learning (RL) is concerned with learning a rewarding policy by executing another policy that gathers samples of experience. While the former policy (i.e. target policy) is rewarding but in-expressive (in most cases, deterministic), doing well in the latter task, in contrast, requires an expressive policy (i.e. behavior policy) that offers guided and effective exploration. Contrary to most methods that make a trade-off between optimality and expressiveness, disentangled frameworks explicitly decouple the two objectives, which each is dealt with by a distinct separate policy. Although being able to freely design and optimize the two policies with respect to their own objectives, naively disentangling them can lead to inefficient learning or stability issues. To mitigate this problem, our proposed method Analogous Disentangled Actor-Critic (ADAC) designs analogous pairs of actors and critics. Specifically, ADAC leverages a key property about Stein variational gradient descent (SVGD) to constraint the expressive energy-based behavior policy with respect to the target one for effective exploration. Additionally, an analogous critic pair is introduced to incorporate intrinsic rewards in a principled manner, with theoretical guarantees on the overall learning stability and effectiveness. We empirically evaluate environment-reward-only ADAC on 14 continuous-control tasks and report the state-of-the-art on 10 of them. We further demonstrate ADAC, when paired with intrinsic rewards, outperform alternatives in exploration-challenging tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题