马尔可夫决策过程中的强大批处理政策学习

论文标题

马尔可夫决策过程中的强大批处理政策学习

Robust Batch Policy Learning in Markov Decision Processes

论文作者

Qi, Zhengling, Liao, Peng

论文摘要

我们在马尔可夫决策过程（MDP）的框架中研究离线数据驱动的顺序决策问题。为了提高学识渊博的政策的普遍性和适应性，我们建议通过在策略引起的固定分配中的一组平均奖励来评估每个政策。考虑到某些行为策略生成的多个轨迹的预采用数据集，我们的目标是在预先指定的策略类中学习强大的策略，该策略类别可以最大化该集合的最小值。利用半参数统计理论，我们开发了一种统计高效的政策学习方法，用于估算DE NED强大的最佳政策。基于数据集中的总决策点，建立了一个最佳的遗憾。

We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP). In order to enhance the generalizability and adaptivity of the learned policy, we propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution. Given a pre-collected dataset of multiple trajectories generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can maximize the smallest value of this set. Leveraging the theory of semi-parametric statistics, we develop a statistically efficient policy learning method for estimating the de ned robust optimal policy. A rate-optimal regret bound up to a logarithmic factor is established in terms of total decision points in the dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题