论文标题
将MPC和价值功能近似融合有效增强学习
Blending MPC & Value Function Approximation for Efficient Reinforcement Learning
论文作者
论文摘要
模型预测性控制(MPC)是控制复杂的现实世界系统的强大工具,该系统使用模型对未来行为进行预测。对于遇到的每个州,MPC解决了一个在线优化问题,以选择一个控制动作,以最大程度地减少未来成本。这是一个令人惊讶的有效策略,但是实时性能要求需要使用简单的模型。如果模型不够精确,则可以将最终的控制器偏置,从而限制性能。我们提出了一个通过无模型增强学习(RL)改进MPC的框架。关键的见解是将MPC视为构建一系列局部Q功能近似值。我们表明,通过使用参数$λ$,类似于TD中的跟踪衰减参数($λ$),我们可以系统地对局部Q函数近似值进行系统地折衷的学习价值估计。我们提出了一个理论分析,该分析表明如何平衡MPC中不准确模型的误差和RL中的值函数估计。我们进一步提出了一种随着时间的推移而更改$λ$的算法,以减少对MPC的依赖性,因为我们对价值函数的估计提高了,并测试了我们在模拟中使用有偏见模型的挑战高维操作任务的方法。我们证明,即使在严重的模型偏置下,我们的方法也可以获得与MPC相当的性能,并且与无模型RL相比,样本效率更高。
Model-Predictive Control (MPC) is a powerful tool for controlling complex, real-world systems that uses a model to make predictions about future behavior. For each state encountered, MPC solves an online optimization problem to choose a control action that will minimize future cost. This is a surprisingly effective strategy, but real-time performance requirements warrant the use of simple models. If the model is not sufficiently accurate, then the resulting controller can be biased, limiting performance. We present a framework for improving on MPC with model-free reinforcement learning (RL). The key insight is to view MPC as constructing a series of local Q-function approximations. We show that by using a parameter $λ$, similar to the trace decay parameter in TD($λ$), we can systematically trade-off learned value estimates against the local Q-function approximations. We present a theoretical analysis that shows how error from inaccurate models in MPC and value function estimation in RL can be balanced. We further propose an algorithm that changes $λ$ over time to reduce the dependence on MPC as our estimates of the value function improve, and test the efficacy our approach on challenging high-dimensional manipulation tasks with biased models in simulation. We demonstrate that our approach can obtain performance comparable with MPC with access to true dynamics even under severe model bias and is more sample efficient as compared to model-free RL.