离散时间随机控制系统的组成增强学习

论文标题

离散时间随机控制系统的组成增强学习

Compositional Reinforcement Learning for Discrete-Time Stochastic Control Systems

论文作者

Lavaei, Abolfazl, Perez, Mateo, Kazemi, Milad, Somenzi, Fabio, Soudjani, Sadegh, Trivedi, Ashutosh, Zamani, Majid

论文摘要

我们提出了一种组成方法，以使用模型的强化学习（RL）进行连续空间随机控制系统网络合成策略。该方法基于使用有限的Markov决策过程中隐式抽象网络中的每个子系统，并具有未知的过渡概率，使用RL以假定的方式为每个摘要模型合成每个摘要模型的策略，然后以近似最佳保证的方式将结果映射到原始网络上。我们根据各个子系统的整体网络提供了整体网络的满意度概率。一个关键的贡献是利用有限随机领域的对抗RL（Minimax Q-Learning）的收敛结果，以提供控制策略，以最大程度地利用连续空间系统网络满意度的可能性。我们考虑在线性时间逻辑的句法共同保护片段中表达的有限 - 摩尼子特性。这些属性可以轻松地转换为基于自动的奖励功能，提供适合RL的标量奖励信号。由于这种奖励功能通常很少，因此我们提供了一种基于潜在的奖励成型技术来通过产生浓厚的奖励来加速学习。通过两个物理基准测试，包括对室温网络的调节以及对道路交通网络的控制，可以证明所提出方法的有效性。

We propose a compositional approach to synthesize policies for networks of continuous-space stochastic control systems with unknown dynamics using model-free reinforcement learning (RL). The approach is based on implicitly abstracting each subsystem in the network with a finite Markov decision process with unknown transition probabilities, synthesizing a strategy for each abstract model in an assume-guarantee fashion using RL, and then mapping the results back over the original network with approximate optimality guarantees. We provide lower bounds on the satisfaction probability of the overall network based on those over individual subsystems. A key contribution is to leverage the convergence results for adversarial RL (minimax Q-learning) on finite stochastic arenas to provide control strategies maximizing the probability of satisfaction over the network of continuous-space systems. We consider finite-horizon properties expressed in the syntactically co-safe fragment of linear temporal logic. These properties can readily be converted into automata-based reward functions, providing scalar reward signals suitable for RL. Since such reward functions are often sparse, we supply a potential-based reward shaping technique to accelerate learning by producing dense rewards. The effectiveness of the proposed approaches is demonstrated via two physical benchmarks including regulation of a room temperature network and control of a road traffic network.

下载PDF全文

下载文献需遵守相关版权规定

论文标题