论文标题
从单个路径评估的策略评估:多步法,混合和错误指定
Policy evaluation from a single path: Multi-step methods, mixing and mis-specification
论文作者
论文摘要
我们使用单个轨迹的观察结果研究了无限 - 摩托克$γ$ discoussed Markov奖励过程(MRP)的价值函数的非参数估计。我们为一般基于内核的多步临时差异(TD)估计值提供了非反应保证,包括$ k = 1、2,\ ldots $和TD $(λ)$ family的规范$ k $ k $ step look-ahead td td,in [0,1)$ in [0,1)$作为特殊情况。我们的边界捕获了其对钟声波动的依赖性,混合了马尔可夫链的时间,模型中的任何错误指定以及定义估计器本身的重量函数的选择,并揭示了混合时间和模型错误指定之间的一些微妙的相互作用。对于应用于良好指定模型的给定TD方法,其轨迹数据统计误差与i.i.d.的统计误差相似。样品过渡对,而在错误指定下,数据依赖性膨胀统计误差。但是,任何这种恶化都可以通过增加的外观来减轻。我们通过证明Minimax的下限来补充我们的上限,这些下限以适当选择的外观和权重建立了基于TD的方法的最佳性,并揭示了价值函数估计与普通非参数回归之间的一些基本差异。
We study non-parametric estimation of the value function of an infinite-horizon $γ$-discounted Markov reward process (MRP) using observations from a single trajectory. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference (TD) estimates, including canonical $K$-step look-ahead TD for $K = 1, 2, \ldots$ and the TD$(λ)$ family for $λ\in [0,1)$ as special cases. Our bounds capture its dependence on Bellman fluctuations, mixing time of the Markov chain, any mis-specification in the model, as well as the choice of weight function defining the estimator itself, and reveal some delicate interactions between mixing time and model mis-specification. For a given TD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under mis-specification, temporal dependence in data inflates the statistical error. However, any such deterioration can be mitigated by increased look-ahead. We complement our upper bounds by proving minimax lower bounds that establish optimality of TD-based methods with appropriately chosen look-ahead and weighting, and reveal some fundamental differences between value function estimation and ordinary non-parametric regression.