论文标题

乐观的好奇心探索和线性奖励塑造的保守剥削

Optimistic Curiosity Exploration and Conservative Exploitation with Linear Reward Shaping

论文作者

Sun, Hao, Han, Lei, Yang, Rui, Ma, Xiaoteng, Guo, Jian, Zhou, Bolei

论文摘要

在这项工作中,我们研究了简单但普遍适用的基于价值的深入增强学习(DRL)的奖励成型案例。我们表明,线性转换形式的奖励转移等同于更改函数近似中$ q $ function的初始化。基于这样的等价性,我们带来了关键的见解,即积极的奖励转移会导致保守的剥削,而负面的奖励转移会导致好奇心驱动的探索。因此,保守的剥削改善了离线RL价值估计,乐观的价值估计改善了在线RL的勘探。我们验证了我们对一系列RL任务的见解,并显示了其对基准的改进:(1)在离线RL中,保守的剥削会导致基于现成的算法提高性能; (2)在在线连续控制中,具有不同转移常数的多个值函数可用于应对探索 - 开发困境,以提高样品效率; (3)在离散控制任务中,负奖励转移可以改善基于好奇心的探索方法。

In this work, we study the simple yet universally applicable case of reward shaping in value-based Deep Reinforcement Learning (DRL). We show that reward shifting in the form of the linear transformation is equivalent to changing the initialization of the $Q$-function in function approximation. Based on such an equivalence, we bring the key insight that a positive reward shifting leads to conservative exploitation, while a negative reward shifting leads to curiosity-driven exploration. Accordingly, conservative exploitation improves offline RL value estimation, and optimistic value estimation improves exploration for online RL. We validate our insight on a range of RL tasks and show its improvement over baselines: (1) In offline RL, the conservative exploitation leads to improved performance based on off-the-shelf algorithms; (2) In online continuous control, multiple value functions with different shifting constants can be used to tackle the exploration-exploitation dilemma for better sample efficiency; (3) In discrete control tasks, a negative reward shifting yields an improvement over the curiosity-based exploration method.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源