上面的反事实信心范围：上下文匪徒的新乐观原则

论文标题

上面的反事实信心范围：上下文匪徒的新乐观原则

Upper Counterfactual Confidence Bounds: a New Optimism Principle for Contextual Bandits

论文作者

Xu, Yunbei, Zeevi, Assaf

论文摘要

面对不确定性的乐观原则是多臂匪徒和增强学习中最广泛使用和成功的想法之一。但是，现有的乐观算法（主要是UCB及其变体）通常很难处理一般功能类别和大型上下文空间。在本文中，我们研究了具有离线回归甲骨文的一般背景土匪，并提出了一个简单的，通用的原理来设计乐观的算法，称为“上层反事实置信度范围”（UCCB）。 UCCB的关键创新是在政策领域建立信心界限，而不是像UCB那样在行动领域中建立信心。我们证明，这些算法在处理通用功能类别和较大的上下文空间方面是最佳和计算有效的。此外，我们说明UCCB原理可以无缝扩展到无限效法的一般上下文匪徒，在使用离线回归甲骨文时为这些设置提供了第一个解决方案。

The principle of optimism in the face of uncertainty is one of the most widely used and successful ideas in multi-armed bandits and reinforcement learning. However, existing optimistic algorithms (primarily UCB and its variants) often struggle to deal with general function classes and large context spaces. In this paper, we study general contextual bandits with an offline regression oracle and propose a simple, generic principle to design optimistic algorithms, dubbed "Upper Counterfactual Confidence Bounds" (UCCB). The key innovation of UCCB is building confidence bounds in policy space, rather than in action space as is done in UCB. We demonstrate that these algorithms are provably optimal and computationally efficient in handling general function classes and large context spaces. Furthermore, we illustrate that the UCCB principle can be seamlessly extended to infinite-action general contextual bandits, provide the first solutions to these settings when employing an offline regression oracle.

下载PDF全文

下载文献需遵守相关版权规定

论文标题