论文标题

自适应奖励探索

Adaptive Reward-Free Exploration

论文作者

Kaufmann, Emilie, Ménard, Pierre, Domingues, Omar Darwiche, Jonsson, Anders, Leurent, Edouard, Valko, Michal

论文摘要

无奖励探索是Jin等人研究的强化学习环境。 (2020年),通过运行多种遗憾的算法通过同行保证。在我们的工作中,我们提供了一种更自然的自适应方法,用于无奖励探索,这直接减少了最大MDP估计误差的上限。我们表明,有趣的是,我们的无奖励UCRL算法可以看作是1994年以来Fiechter算法的一种变体,最初提出了一个不同的目标,我们称之为最佳政策识别。我们证明,订单$({sah^4}/{\ varepsilon^2})的RF-UCRL需要输出,$ \ varepsilon $ - approximation the the grounder prolight oppartimation $ $ \ varepsilon $ approximation for任何奖励函数。在小$ \ varepsilon $和小$δ$制度中,这种界限比现有的样本复杂性界限。我们进一步研究了无奖励探索和最佳政策识别的相对复杂性。

Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order $({SAH^4}/{\varepsilon^2})(\log(1/δ) + S)$ episodes to output, with probability $1-δ$, an $\varepsilon$-approximation of the optimal policy for any reward function. This bound improves over existing sample-complexity bounds in both the small $\varepsilon$ and the small $δ$ regimes. We further investigate the relative complexities of reward-free exploration and best-policy identification.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源