论文标题
几乎最小的最佳无奖励增强学习
Nearly Minimax Optimal Reward-free Reinforcement Learning
论文作者
论文摘要
我们研究了无奖励的增强学习框架,该框架特别适合批处理增强学习和方案,其中人们需要政策来获得多个奖励功能。该框架有两个阶段。在探索阶段,代理通过与环境相互作用而无需使用任何奖励信号来收集轨迹。在计划阶段,代理需要返回任意奖励功能的近乎最佳政策。我们给出了一种新的有效算法,\ textbf {s}提取\ textbf {s}放大 + \ textbf {t} runcated \ textbf {p} lanning(\ algoname),该(\ algoname)与环境相互作用,最多与环境相互作用( \ frac {s^2a} {ε^2} \ text {poly} \ log \ left(\ frac {sah}ε\ right)\ right)$ pistodes $ eviepotes $ episotes $ eviepotes $ eviepotes $ eviepotes $ eviepotes $ eviepotes $ pistodes,并保证在计划阶段中输出近乎最佳的策略的临近策略。在这里,$ s $是状态空间的大小,$ a $是行动空间的大小,$ h $是计划范围,而$ε$是相对于总奖励的目标精度。值得注意的是,我们的样本复杂性仅与$ h $相比,与所有现有结果相比,\ emph {对数}与$ h $相比。此外,此界限将minimax下限$ω\ left(\ frac {s^2a} {ε^2} \ right)$与对数因素相匹配。 我们的结果依赖于三种新技术:1)数据集计划$ε$ -suboptimal策略的新条件; 2)一种使用软截断计划在拟议条件下有效计划的新方法; 3)构建扩展的MDP以有效地最大化截断的累积奖励。
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases. In the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal. In the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efficient algorithm, \textbf{S}taged \textbf{S}ampling + \textbf{T}runcated \textbf{P}lanning (\algoname), which interacts with the environment at most $O\left( \frac{S^2A}{ε^2}\text{poly}\log\left(\frac{SAH}ε\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase. Here, $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon, and $ε$ is the target accuracy relative to the total reward. Notably, our sample complexity scales only \emph{logarithmically} with $H$, in contrast to all existing results which scale \emph{polynomially} with $H$. Furthermore, this bound matches the minimax lower bound $Ω\left(\frac{S^2A}{ε^2}\right)$ up to logarithmic factors. Our results rely on three new techniques : 1) A new sufficient condition for the dataset to plan for an $ε$-suboptimal policy; 2) A new way to plan efficiently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efficiently.