论文标题
通过轨迹奖励,可证明有效的离线加固学习
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward
论文作者
论文摘要
强化学习(RL)的显着成功在很大程度上依赖于观察每个访问的州行动对的奖励。但是,在许多现实世界的应用中,代理只能观察一个代表整个轨迹质量的分数,该分数称为{\ em轨迹方面的奖励}。在这种情况下,标准RL方法很难很好地利用轨迹的奖励,并且在政策评估中可能会出现巨大的偏见和方差错误。在这项工作中,我们提出了一种新颖的离线RL算法,称为“悲观价值迭代”,奖励分解(分开),该算法将轨迹返回到每个步骤代理奖励中,通过基于最小二乘的奖励再分配,然后基于学识渊博的代理奖励执行悲观的价值迭代。为了确保由分开构建的价值函数在最佳方面始终是悲观的,我们设计了一个新的罚款术语来抵消代理奖励的不确定性。对于具有较大状态空间的一般情节MDP,我们表明,与过度参数化的神经网络函数近似近似能够实现$ \ tilde {\ Mathcal {o {o}}(d _ {\ text {fext {eff}} h^2/\ sqrt {n})$ suboptimatie,$ s $ semply是$ hem $ hem $ $ d _ {\ text {eff}} $是神经切线内核矩阵的有效维度。为了进一步说明结果,我们表明,分开实现了一个$ \ tilde {\ Mathcal {o}}(dh^3/\ sqrt {n})$ subibtimation fi linearearmemality linearearmality linearearemality,$ d $,其中$ d $是功能尺寸,与$ d _ {$ d _ \ deff in the the the the the the the the the with $ d _ {据我们所知,分开是第一个离线RL算法,在MDP总体上,轨道上的奖励是有效的。
The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the {\em trajectory-wise reward}. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect to the optimal ones, we design a new penalty term to offset the uncertainty of the proxy reward. For general episodic MDPs with large state space, we show that PARTED with overparameterized neural network function approximation achieves an $\tilde{\mathcal{O}}(D_{\text{eff}}H^2/\sqrt{N})$ suboptimality, where $H$ is the length of episode, $N$ is the total number of samples, and $D_{\text{eff}}$ is the effective dimension of the neural tangent kernel matrix. To further illustrate the result, we show that PARTED achieves an $\tilde{\mathcal{O}}(dH^3/\sqrt{N})$ suboptimality with linear MDPs, where $d$ is the feature dimension, which matches with that with neural network function approximation, when $D_{\text{eff}}=dH$. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.