论文标题
从侦察盲目的观察结果中监督和加强学习
Supervised and Reinforcement Learning from Observations in Reconnaissance Blind Chess
论文作者
论文摘要
在这项工作中,我们适应了一种受原始Alphago系统启发的训练方法,以演奏侦察盲象棋的不完美信息游戏。我们仅使用观测值而不是对游戏状态的完整描述,我们首先在公开可用的游戏记录上训练监督代理。接下来,我们通过自我播放来提高代理商的性能,并使用policy强化学习算法近端策略优化。我们不使用任何搜索来避免由于游戏状态的部分可观察性引起的问题,而仅使用策略网络在玩游戏时生成动作。通过这种方法,我们在RBC排行榜上达到了1330的ELO,这将我们的经纪人置于撰写本文时的位置27。我们看到,自我戏剧会显着提高性能,并且代理商在没有搜索的情况下可以很好地发挥,而无需对真正的游戏状态做出假设。
In this work, we adapt a training approach inspired by the original AlphaGo system to play the imperfect information game of Reconnaissance Blind Chess. Using only the observations instead of a full description of the game state, we first train a supervised agent on publicly available game records. Next, we increase the performance of the agent through self-play with the on-policy reinforcement learning algorithm Proximal Policy Optimization. We do not use any search to avoid problems caused by the partial observability of game states and only use the policy network to generate moves when playing. With this approach, we achieve an ELO of 1330 on the RBC leaderboard, which places our agent at position 27 at the time of this writing. We see that self-play significantly improves performance and that the agent plays acceptably well without search and without making assumptions about the true game state.