从单个路径评估的策略评估：多步法，混合和错误指定

论文标题

从单个路径评估的策略评估：多步法，混合和错误指定

Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

论文作者

Duan, Yaqi, Wainwright, Martin J.

论文摘要

我们使用单个轨迹的观察结果研究了无限 - 摩托克$γ$ discoussed Markov奖励过程（MRP）的价值函数的非参数估计。我们为一般基于内核的多步临时差异（TD）估计值提供了非反应保证，包括$ k = 1、2，\ ldots $和TD $（λ）$ family的规范$ k $ k $ step look-ahead td td，in [0,1）$ in [0,1）$作为特殊情况。我们的边界捕获了其对钟声波动的依赖性，混合了马尔可夫链的时间，模型中的任何错误指定以及定义估计器本身的重量函数的选择，并揭示了混合时间和模型错误指定之间的一些微妙的相互作用。对于应用于良好指定模型的给定TD方法，其轨迹数据统计误差与i.i.d.的统计误差相似。样品过渡对，而在错误指定下，数据依赖性膨胀统计误差。但是，任何这种恶化都可以通过增加的外观来减轻。我们通过证明Minimax的下限来补充我们的上限，这些下限以适当选择的外观和权重建立了基于TD的方法的最佳性，并揭示了价值函数估计与普通非参数回归之间的一些基本差异。

We study non-parametric estimation of the value function of an infinite-horizon $γ$-discounted Markov reward process (MRP) using observations from a single trajectory. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference (TD) estimates, including canonical $K$-step look-ahead TD for $K = 1, 2, \ldots$ and the TD$(λ)$ family for $λ\in [0,1)$ as special cases. Our bounds capture its dependence on Bellman fluctuations, mixing time of the Markov chain, any mis-specification in the model, as well as the choice of weight function defining the estimator itself, and reveal some delicate interactions between mixing time and model mis-specification. For a given TD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under mis-specification, temporal dependence in data inflates the statistical error. However, any such deterioration can be mitigated by increased look-ahead. We complement our upper bounds by proving minimax lower bounds that establish optimality of TD-based methods with appropriately chosen look-ahead and weighting, and reveal some fundamental differences between value function estimation and ordinary non-parametric regression.

下载PDF全文

下载文献需遵守相关版权规定

论文标题