蒙特卡洛树搜索中价值备份和探索的统一视角

论文标题

蒙特卡洛树搜索中价值备份和探索的统一视角

A Unified Perspective on Value Backup and Exploration in Monte-Carlo Tree Search

论文作者

Dam, Tuan, D'Eramo, Carlo, Peters, Jan, Pajarinen, Joni

论文摘要

蒙特卡洛树搜索（MCT）是通过蒙特卡洛计划和加强学习（RL）的协同作用来解决复杂决策问题的一类方法。 MCT通常解决的问题的高度组合性质需要使用有效的勘探策略来导航计划树并快速收敛价值备份方法。这些关键问题在最近的进展中尤其明显，将MCT与深层神经网络相结合以进行近似。在这项工作中，我们提出了两种基于新引入的备份操作员和熵正则化的方法来提高收敛速率和探索。我们为界限收敛率，近似错误和方法的遗憾提供了强大的理论保证。此外，我们基于使用$α$ divergence在MCT中的备份和探索的使用引入了数学框架。我们表明，在相同的数学框架下，这种理论表述统一了不同的方法，包括我们新引入的方法，从而可以通过简单地更改$α$的值来获得不同的方法。在实践中，我们的统一视角通过根据手头的问题调整单个$α$参数来平衡探索和剥削之间的灵活方法。我们通过一项严格的实证研究从基本的玩具问题到复杂的Atari游戏，以及包括MDP和POMDP问题来验证我们的方法。

Monte-Carlo Tree Search (MCTS) is a class of methods for solving complex decision-making problems through the synergy of Monte-Carlo planning and Reinforcement Learning (RL). The highly combinatorial nature of the problems commonly addressed by MCTS requires the use of efficient exploration strategies for navigating the planning tree and quickly convergent value backup methods. These crucial problems are particularly evident in recent advances that combine MCTS with deep neural networks for function approximation. In this work, we propose two methods for improving the convergence rate and exploration based on a newly introduced backup operator and entropy regularization. We provide strong theoretical guarantees to bound convergence rate, approximation error, and regret of our methods. Moreover, we introduce a mathematical framework based on the use of the $α$-divergence for backup and exploration in MCTS. We show that this theoretical formulation unifies different approaches, including our newly introduced ones, under the same mathematical framework, allowing to obtain different methods by simply changing the value of $α$. In practice, our unified perspective offers a flexible way to balance between exploration and exploitation by tuning the single $α$ parameter according to the problem at hand. We validate our methods through a rigorous empirical study from basic toy problems to the complex Atari games, and including both MDP and POMDP problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题