乐观的好奇心探索和线性奖励塑造的保守剥削

论文标题

乐观的好奇心探索和线性奖励塑造的保守剥削

Optimistic Curiosity Exploration and Conservative Exploitation with Linear Reward Shaping

论文作者

Sun, Hao, Han, Lei, Yang, Rui, Ma, Xiaoteng, Guo, Jian, Zhou, Bolei

论文摘要

在这项工作中，我们研究了简单但普遍适用的基于价值的深入增强学习（DRL）的奖励成型案例。我们表明，线性转换形式的奖励转移等同于更改函数近似中$ q $ function的初始化。基于这样的等价性，我们带来了关键的见解，即积极的奖励转移会导致保守的剥削，而负面的奖励转移会导致好奇心驱动的探索。因此，保守的剥削改善了离线RL价值估计，乐观的价值估计改善了在线RL的勘探。我们验证了我们对一系列RL任务的见解，并显示了其对基准的改进：（1）在离线RL中，保守的剥削会导致基于现成的算法提高性能；（2）在在线连续控制中，具有不同转移常数的多个值函数可用于应对探索 - 开发困境，以提高样品效率；（3）在离散控制任务中，负奖励转移可以改善基于好奇心的探索方法。

In this work, we study the simple yet universally applicable case of reward shaping in value-based Deep Reinforcement Learning (DRL). We show that reward shifting in the form of the linear transformation is equivalent to changing the initialization of the $Q$-function in function approximation. Based on such an equivalence, we bring the key insight that a positive reward shifting leads to conservative exploitation, while a negative reward shifting leads to curiosity-driven exploration. Accordingly, conservative exploitation improves offline RL value estimation, and optimistic value estimation improves exploration for online RL. We validate our insight on a range of RL tasks and show its improvement over baselines: (1) In offline RL, the conservative exploitation leads to improved performance based on off-the-shelf algorithms; (2) In online continuous control, multiple value functions with different shifting constants can be used to tackle the exploration-exploitation dilemma for better sample efficiency; (3) In discrete control tasks, a negative reward shifting yields an improvement over the curiosity-based exploration method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题