平滑优势学习

论文标题

平滑优势学习

Smoothing Advantage Learning

论文作者

Gan, Yaozhong, Zhang, Zhe, Tan, Xiaoyang

论文摘要

优势学习（AL）旨在通过基于动作差距的正则化来提高基于价值的增强学习的鲁棒性，以防止估计错误。不幸的是，在功能近似的情况下，该方法往往是不稳定的。在本文中，我们提出了一个简单的AL变体，称为平滑优势学习（SAL），以减轻此问题。我们方法的关键是用平滑的AL中的原始Bellman最佳操作员替换原始的最佳操作员，以便获得对时间差异目标的更可靠的估计。我们详细说明了由此产生的动作差距和近似SAL的性能。进一步的理论分析表明，所提出的价值平滑技术不仅可以通过控制收敛速率和近似错误的上限之间的权衡来稳定AL的训练程序，而且对增加最佳和次级动作值之间的动作差距也是有益的。

Advantage learning (AL) aims to improve the robustness of value-based reinforcement learning against estimation errors with action-gap-based regularization. Unfortunately, the method tends to be unstable in the case of function approximation. In this paper, we propose a simple variant of AL, named smoothing advantage learning (SAL), to alleviate this problem. The key to our method is to replace the original Bellman Optimal operator in AL with a smooth one so as to obtain more reliable estimation of the temporal difference target. We give a detailed account of the resulting action gap and the performance bound for approximate SAL. Further theoretical analysis reveals that the proposed value smoothing technique not only helps to stabilize the training procedure of AL by controlling the trade-off between convergence rate and the upper bound of the approximation errors, but is beneficial to increase the action gap between the optimal and sub-optimal action value as well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题