自适应奖励探索

论文标题

自适应奖励探索

Adaptive Reward-Free Exploration

论文作者

Kaufmann, Emilie, Ménard, Pierre, Domingues, Omar Darwiche, Jonsson, Anders, Leurent, Edouard, Valko, Michal

论文摘要

无奖励探索是Jin等人研究的强化学习环境。（2020年），通过运行多种遗憾的算法通过同行保证。在我们的工作中，我们提供了一种更自然的自适应方法，用于无奖励探索，这直接减少了最大MDP估计误差的上限。我们表明，有趣的是，我们的无奖励UCRL算法可以看作是1994年以来Fiechter算法的一种变体，最初提出了一个不同的目标，我们称之为最佳政策识别。我们证明，订单$（{sah^4}/{\ varepsilon^2}）的RF-UCRL需要输出，$ \ varepsilon $ - approximation the the grounder prolight oppartimation $ $ \ varepsilon $ approximation for任何奖励函数。在小$ \ varepsilon $和小$δ$制度中，这种界限比现有的样本复杂性界限。我们进一步研究了无奖励探索和最佳政策识别的相对复杂性。

Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order $({SAH^4}/{\varepsilon^2})(\log(1/δ) + S)$ episodes to output, with probability $1-δ$, an $\varepsilon$-approximation of the optimal policy for any reward function. This bound improves over existing sample-complexity bounds in both the small $\varepsilon$ and the small $δ$ regimes. We further investigate the relative complexities of reward-free exploration and best-policy identification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题