惩罚对安全加强学习的近端政策优化

论文标题

惩罚对安全加强学习的近端政策优化

Penalized Proximal Policy Optimization for Safe Reinforcement Learning

论文作者

Zhang, Linrui, Shen, Li, Yang, Long, Chen, Shixiang, Yuan, Bo, Wang, Xueqian, Tao, Dacheng

论文摘要

安全的加强学习旨在学习最佳政策，同时满足安全限制，这在现实世界中至关重要。但是，当前的算法仍在为有效的政策更新而努力，并具有严格的约束满意度。在本文中，我们提出了受惩罚的近端策略优化（P3O），该政策优化（P3O）通过对等效不受约束的问题的单一最小化解决了麻烦的受约束政策迭代。具体而言，P3O利用了简单的罚款功能来消除成本限制，并通过剪裁的替代目标消除了信任区域的约束。从理论上讲，我们用有限的惩罚因素证明了所提出的方法的精确性，并在对样品轨迹进行评估时提供了最坏情况分析，以实现近似误差。此外，我们将P3O扩展到更具挑战性的多构造和多代理方案，这些方案在以前的工作中所研究的情况较少。广泛的实验表明，在一组受约束的机车任务上，P3O优于奖励改进和约束满意度的最先进算法。

Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the proposed method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题