适应挑剔的客户：多目标增强学习的遗憾和探索复杂性

论文标题

适应挑剔的客户：多目标增强学习的遗憾和探索复杂性

Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

论文作者

Wu, Jingfeng, Braverman, Vladimir, Yang, Lin F.

论文摘要

在本文中，我们考虑使用偏好平衡目标的多目标增强学习。实际上，偏好通常以对抗性方式给出，例如，客户在许多应用程序中可能会很挑剔。我们将这个问题形式化为马尔可夫决策过程中的一个情节学习问题，在马尔可夫决策过程中，过渡是未知的，奖励功能是具有预先指定的多目标奖励功能的偏好矢量的内部产物。我们考虑两个设置。在在线环境中，代理会收到（对抗性的）每集（对抗性）偏好，并提出与环境互动的政策。 We provide a model-based algorithm that achieves a nearly minimax optimal regret bound $\widetilde{\mathcal{O}}\bigl(\sqrt{\min\{d,S\}\cdot H^2 SAK}\bigr)$, where $d$ is the number of objectives, $S$ is the number of states, $A$ is the number of actions, $ h $是地平线的长度，$ k $是情节数。此外，我们考虑无疑探索，即代理首先与环境进行交互，而无需指定任何偏好，然后能够容纳任意优先矢量范围高达$ε$误差。我们提出的算法可证明具有差异的轨迹复杂性$ \ wideTilde {\ Mathcal {o}}} \ bigl（{\ min \ {d，s \} \ cdot h^3 sa}/{ε^2} \ bigR）$。该结果部分解决了由\ citet {jin2020reward}提出的一个开放问题。

In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. We consider two settings. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound $\widetilde{\mathcal{O}}\bigl(\sqrt{\min\{d,S\}\cdot H^2 SAK}\bigr)$, where $d$ is the number of objectives, $S$ is the number of states, $A$ is the number of actions, $H$ is the length of the horizon, and $K$ is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vector up to $ε$ error. Our proposed algorithm is provably efficient with a nearly optimal trajectory complexity $\widetilde{\mathcal{O}}\bigl({\min\{d,S\}\cdot H^3 SA}/{ε^2}\bigr)$. This result partly resolves an open problem raised by \citet{jin2020reward}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题