双汤普森在有限随机游戏中抽样

论文标题

双汤普森在有限随机游戏中抽样

Double Thompson Sampling in Finite stochastic Games

论文作者

Shi, Shuqing, Wang, Xiaobin, Yang, Zhiyou, Zhang, Fan, Qu, Hong

论文摘要

我们考虑在有限的折扣马尔可夫决策过程中勘探和剥削之间的权衡问题，其中基础环境的状态过渡矩阵保持未知。我们提出了一种双汤普森采样增强学习算法（DTS）来解决此类问题。该算法达到了$ \ tilde {\ mathcal {o}}}（d \ sqrt {sat}）$ in Time Horizon $ t $带有$ s $ state，$ a $ a $ action和diameter $ d $。 DTS由两个部分组成，第一部分是我们基于先前分布的过渡矩阵上的后验采样方法的传统部分。在第二部分中，我们采用基于计数的后验更新方法来平衡本地最佳动作和长期最佳操作，以找到全局最佳的游戏值。我们建立了$ \ tilde {\ mathcal {o}}（\ sqrt {t}/s^{2}）$的遗憾。到目前为止，这是我们所知的有限折扣马尔可夫决策过程的最佳遗憾。数值结果证明了我们方法的效率和优势。

We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process, where the state transition matrix of the underlying environment stays unknown. We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem. This algorithm achieves a total regret bound of $\tilde{\mathcal{O}}(D\sqrt{SAT})$in time horizon $T$ with $S$ states, $A$ actions and diameter $D$. DTS consists of two parts, the first part is the traditional part where we apply the posterior sampling method on transition matrix based on prior distribution. In the second part, we employ a count-based posterior update method to balance between the local optimal action and the long-term optimal action in order to find the global optimal game value. We established a regret bound of $\tilde{\mathcal{O}}(\sqrt{T}/S^{2})$. Which is by far the best regret bound for finite discounted Markov Decision Process to our knowledge. Numerical results proves the efficiency and superiority of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题