与马尔可夫决策过程混淆的额外置信区间估计

论文标题

与马尔可夫决策过程混淆的额外置信区间估计

Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

论文作者

Shi, Chengchun, Zhu, Jin, Shen, Ye, Luo, Shikai, Zhu, Hongtu, Song, Rui

论文摘要

本文关注的是，基于无限视野设置中预采用的观察数据，为目标策略的价值离线构建置信区间。大多数现有作品都假定没有将观察到的动作混淆的未测量变量。但是，在医疗保健和技术行业等实际应用中可能会违反该假设。在本文中，我们表明，使用一些辅助变量介导动作对系统动态的影响，目标策略的价值在混杂的马尔可夫决策过程中可以识别。基于此结果，我们开发了一个有效的非政策值估计器，该估计量对潜在的模型错误指定具有鲁棒性并提供严格的不确定性量化。我们的方法是通过理论结果，从乘车共享公司获得的模拟和真实数据集证明的。 python的实现可在https://github.com/mamba413/cope上获得。

This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope.

下载PDF全文

下载文献需遵守相关版权规定

论文标题