多代理强化学习中以陌生驱动的探索

论文标题

多代理强化学习中以陌生驱动的探索

Strangeness-driven Exploration in Multi-Agent Reinforcement Learning

论文作者

Kim, Ju-Bong, Choi, Ho-Bin, Han, Youn-Hee

论文摘要

有效的勘探策略是需要复杂协调的合作多代理增强学习（MARL）算法的基本问题之一。在这项研究中，我们引入了一种新的探索方法，其陌生感可以很容易地纳入任何集中式培训和分散的执行（CTDE）基于MARL算法。陌生感是指代理商访问的观察的不熟悉程度。为了使观察陌生感具有全球性观点，它也随着访问的整个州的不熟悉程度而增强。探索奖金是从陌生性中获得的，并且所提出的探索方法不受MARL任务中通常观察到的随机过渡的影响。为了防止高探索奖金使MARL训练对外部奖励不敏感，我们还提出了一个单独的动作值功能，该功能受外部奖励和勘探奖金培训，以此为基于过渡的行为政策。它使基于CTDE的MARL算法与勘探方法一起使用时更加稳定。通过在教学示例和《星际争霸多代理挑战》中的比较评估，我们表明拟议的探索方法在基于CTDE的MARL算法中实现了显着的性能提高。

Efficient exploration strategy is one of essential issues in cooperative multi-agent reinforcement learning (MARL) algorithms requiring complex coordination. In this study, we introduce a new exploration method with the strangeness that can be easily incorporated into any centralized training and decentralized execution (CTDE)-based MARL algorithms. The strangeness refers to the degree of unfamiliarity of the observations that an agent visits. In order to give the observation strangeness a global perspective, it is also augmented with the the degree of unfamiliarity of the visited entire state. The exploration bonus is obtained from the strangeness and the proposed exploration method is not much affected by stochastic transitions commonly observed in MARL tasks. To prevent a high exploration bonus from making the MARL training insensitive to extrinsic rewards, we also propose a separate action-value function trained by both extrinsic reward and exploration bonus, on which a behavioral policy to generate transitions is designed based. It makes the CTDE-based MARL algorithms more stable when they are used with an exploration method. Through a comparative evaluation in didactic examples and the StarCraft Multi-Agent Challenge, we show that the proposed exploration method achieves significant performance improvement in the CTDE-based MARL algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题