通过一般函数近似的增强学习中的模型选择

论文标题

通过一般函数近似的增强学习中的模型选择

Model Selection in Reinforcement Learning with General Function Approximations

论文作者

Ghosh, Avishek, Chowdhury, Sayak Ray

论文摘要

我们考虑在一般函数近似下，考虑了经典增强学习（RL）环境的模型选择（RL）环境 - 多武装强盗（mAb）和马尔可夫决策过程（MDPS）。在模型选择框架中，我们不知道函数类，由$ \ Mathcal {f} $和$ \ Mathcal {M} $表示，其中真实模型 - 奖励MABS和MDPS的过渡内核的奖励生成函数 - 分别撒谎。取而代之的是，我们获得了$ M $嵌套功能（假设）类，使得真正的模型包含在至下一个类别中。在本文中，我们提出并分析了MAB和MDP的有效模型选择算法，将\ Emph {Adapt} \ emph {Adapt}（在嵌套的$ M $类中）（其中包含真实的基础模型）。在嵌套假设类别的可分离性假设下，我们表明我们的自适应算法的累积后悔与知道正确功能类（即$ \ cf $和$ \ cm $）的Oracle的累积后悔相匹配。此外，对于这两种设置，我们都表明，模型选择的成本是一个遗憾的术语，因为它对学习范围$ t $的依赖弱（对数）。

We consider model selection for classic Reinforcement Learning (RL) environments -- Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs) -- under general function approximations. In the model selection framework, we do not know the function classes, denoted by $\mathcal{F}$ and $\mathcal{M}$, where the true models -- reward generating function for MABs and and transition kernel for MDPs -- lie, respectively. Instead, we are given $M$ nested function (hypothesis) classes such that true models are contained in at-least one such class. In this paper, we propose and analyze efficient model selection algorithms for MABs and MDPs, that \emph{adapt} to the smallest function class (among the nested $M$ classes) containing the true underlying model. Under a separability assumption on the nested hypothesis classes, we show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes (i.e., $\cF$ and $\cM$) a priori. Furthermore, for both the settings, we show that the cost of model selection is an additive term in the regret having weak (logarithmic) dependence on the learning horizon $T$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题