卡尔曼遇到贝尔曼：通过价值跟踪改善政策评估

论文标题

卡尔曼遇到贝尔曼：通过价值跟踪改善政策评估

Kalman meets Bellman: Improving Policy Evaluation through Value Tracking

论文作者

Shashua, Shirli Di-Castro, Mannor, Shie

论文摘要

政策评估是加强学习（RL）的关键过程。它通过估计相应的价值函数来评估给定策略。当使用参数化值函数时，通用方法最小化了平方钟形的时间差异错误的总和，并接收参数的点估计。建议基于Kalman和基于高斯程序的框架通过将值视为随机变量来评估该策略。这些框架可以学习对价值参数的不确定性，并利用它们进行策略探索。在采用这些框架来解决深度RL任务时，会发现几个限制：每个优化步骤中的过度计算，难以处理批次的样本，这些样本会减慢训练，并在随机环境中的内存效果，从而阻止了非政策学习。在这项工作中，我们讨论了这些局限性，并建议通过扩展的卡尔曼过滤器来克服替代性的一般框架。我们设计了一种优化方法，称为Kalman优化值近似值（KOVA），可以将其作为策略评估组件（策略评估算法）合并。 KOVA最大程度地限制了涉及参数和嘈杂返回不确定性的正规目标函数。我们分析了Kova的属性，并在深度RL控制任务上介绍了其性能。

Policy evaluation is a key process in Reinforcement Learning (RL). It assesses a given policy by estimating the corresponding value function. When using parameterized value functions, common approaches minimize the sum of squared Bellman temporal-difference errors and receive a point-estimate for the parameters. Kalman-based and Gaussian-processes based frameworks were suggested to evaluate the policy by treating the value as a random variable. These frameworks can learn uncertainties over the value parameters and exploit them for policy exploration. When adopting these frameworks to solve deep RL tasks, several limitations are revealed: excessive computations in each optimization step, difficulty with handling batches of samples which slows training and the effect of memory in stochastic environments which prevents off-policy learning. In this work, we discuss these limitations and propose to overcome them by an alternative general framework, based on the extended Kalman filter. We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA) that can be incorporated as a policy evaluation component in policy optimization algorithms. KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties. We analyze the properties of KOVA and present its performance on deep RL control tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题