论文标题

通过一般函数近似的增强学习中的模型选择

Model Selection in Reinforcement Learning with General Function Approximations

论文作者

Ghosh, Avishek, Chowdhury, Sayak Ray

论文摘要

我们考虑在一般函数近似下,考虑了经典增强学习(RL)环境的模型选择(RL)环境 - 多武装强盗(mAb)和马尔可夫决策过程(MDPS)。在模型选择框架中,我们不知道函数类,由$ \ Mathcal {f} $和$ \ Mathcal {M} $表示,其中真实模型 - 奖励MABS和MDPS的过渡内核的奖励生成函数 - 分别撒谎。取而代之的是,我们获得了$ M $嵌套功能(假设)类,使得真正的模型包含在至下一个类别中。在本文中,我们提出并分析了MAB和MDP的有效模型选择算法,将\ Emph {Adapt} \ emph {Adapt}(在嵌套的$ M $类中)(其中包含真实的基础模型)。在嵌套假设类别的可分离性假设下,我们表明我们的自适应算法的累积后悔与知道正确功能类(即$ \ cf $和$ \ cm $)的Oracle的累积后悔相匹配。此外,对于这两种设置,我们都表明,模型选择的成本是一个遗憾的术语,因为它对学习范围$ t $的依赖弱(对数)。

We consider model selection for classic Reinforcement Learning (RL) environments -- Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs) -- under general function approximations. In the model selection framework, we do not know the function classes, denoted by $\mathcal{F}$ and $\mathcal{M}$, where the true models -- reward generating function for MABs and and transition kernel for MDPs -- lie, respectively. Instead, we are given $M$ nested function (hypothesis) classes such that true models are contained in at-least one such class. In this paper, we propose and analyze efficient model selection algorithms for MABs and MDPs, that \emph{adapt} to the smallest function class (among the nested $M$ classes) containing the true underlying model. Under a separability assumption on the nested hypothesis classes, we show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes (i.e., $\cF$ and $\cM$) a priori. Furthermore, for both the settings, we show that the cost of model selection is an additive term in the regret having weak (logarithmic) dependence on the learning horizon $T$.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源