论文标题

通过学习动机一致的内在奖励设计自动奖励设计

Automatic Reward Design via Learning Motivation-Consistent Intrinsic Rewards

论文作者

Wang, Yixiang, Hu, Yujing, Wu, Feng, Chen, Yingfeng

论文摘要

奖励设计是增强学习应用的关键部分,其性能在很大程度上取决于奖励信号的效果如何构成设计师的目标以及信号评估达到该目标的进度的程度。在许多情况下,环境提供的外部奖励(例如,胜利或丢失游戏)非常稀疏,因此很难直接训练代理。研究人员通常通过在实践中添加一些辅助奖励来帮助学习代理商。但是,设计辅助奖励通常转向试用搜索奖励设置,从而产生可接受的结果。在本文中,我们建议通过最大化哪些预期的累积外部奖励可以最大化,从而自动为代理人生成固有的内在奖励。为此,我们介绍了动机的概念,该概念捕获了最大化某些奖励并提出基于动机的奖励设计方法的基本目标。基本思想是通过最大程度地减少内在动机和外在动机之间的距离来塑造内在的奖励。我们进行了广泛的实验,并表明我们的方法在处理延迟奖励,探索和信用分配问题方面的最新方法要好。

Reward design is a critical part of the application of reinforcement learning, the performance of which strongly depends on how well the reward signal frames the goal of the designer and how well the signal assesses progress in reaching that goal. In many cases, the extrinsic rewards provided by the environment (e.g., win or loss of a game) are very sparse and make it difficult to train agents directly. Researchers usually assist the learning of agents by adding some auxiliary rewards in practice. However, designing auxiliary rewards is often turned to a trial-and-error search for reward settings that produces acceptable results. In this paper, we propose to automatically generate goal-consistent intrinsic rewards for the agent to learn, by maximizing which the expected accumulative extrinsic rewards can be maximized. To this end, we introduce the concept of motivation which captures the underlying goal of maximizing certain rewards and propose the motivation based reward design method. The basic idea is to shape the intrinsic rewards by minimizing the distance between the intrinsic and extrinsic motivations. We conduct extensive experiments and show that our method performs better than the state-of-the-art methods in handling problems of delayed reward, exploration, and credit assignment.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源