论文标题
具有动态的在线学习:minimax的观点
Online learning with dynamics: A minimax perspective
论文作者
论文摘要
我们研究了通过动态的在线学习问题,其中学习者在多个回合中与状态环境互动。在每一轮互动中,学习者都会选择一项政策来部署并产生取决于所选政策和当前状态的成本。州进化的动态和成本被允许以对抗性方式变化。在这种情况下,我们研究了最大程度地减少政策后悔的问题,并就问题的最小值率提供了非结构性上限。 我们的主要结果为通过相应费率的这种设置提供了足够的条件。费率的特征是1)一个复杂性项在国家变化的动态下捕获基础政策类别的表现力,以及2)测量瞬时损失偏离特定反事实损失的动力稳定项。此外,我们提供匹配的下限,这表明两个复杂性术语确实是必要的。 我们的方法提供了一个统一的分析,该分析恢复了一些研究的问题,包括在线学习,包括内存的在线学习,线性二次调节器的在线控制,在线马尔可夫决策过程以及跟踪对抗性目标。此外,我们还展示了我们的工具如何帮助解决一个新问题(具有非线性动态和非凸面损失)的紧密遗憾,在我们工作之前不知道这种界限。
We study the problem of online learning with dynamics, where a learner interacts with a stateful environment over multiple rounds. In each round of the interaction, the learner selects a policy to deploy and incurs a cost that depends on both the chosen policy and current state of the world. The state-evolution dynamics and the costs are allowed to be time-varying, in a possibly adversarial way. In this setting, we study the problem of minimizing policy regret and provide non-constructive upper bounds on the minimax rate for the problem. Our main results provide sufficient conditions for online learnability for this setup with corresponding rates. The rates are characterized by 1) a complexity term capturing the expressiveness of the underlying policy class under the dynamics of state change, and 2) a dynamics stability term measuring the deviation of the instantaneous loss from a certain counterfactual loss. Further, we provide matching lower bounds which show that both the complexity terms are indeed necessary. Our approach provides a unifying analysis that recovers regret bounds for several well studied problems including online learning with memory, online control of linear quadratic regulators, online Markov decision processes, and tracking adversarial targets. In addition, we show how our tools help obtain tight regret bounds for a new problems (with non-linear dynamics and non-convex losses) for which such bounds were not known prior to our work.