论文标题
使用本地合奏从示范中学习有效的加强学习,并与专家政策的分裂和合并进行重新聚集
Efficient Reinforcement Learning from Demonstration Using Local Ensemble and Reparameterization with Split and Merge of Expert Policies
论文作者
论文摘要
当前关于强化学习(RL)的工作通常假设演示是最佳政策中的样本,这是实践中不切实际的假设。当示威是通过亚最佳政策产生的或具有稀疏的状态行动对时,从亚最佳示范中学到的政策可能会误导具有不正确或非本地行动决策的代理商。我们提出了一种新方法,称为本地合奏,并通过分裂和专家政策合并(学习-SAM)来提高效率并更好地利用亚最佳示范。首先,Learn-SAM采用了一个新概念,即Lambda功能,基于当前状态之间的差异度量,以证明状态在学习过程中“定位”专家政策的权重。其次,Learn-SAM通过将每个专家演示中的有用部分分开并将其重新组合为新的专家政策,以选择性地使用演示,从而采用分裂和合并的机制。 Lambda功能和SAM机制都有助于提高学习速度。从理论上讲,我们在SAM机制之前和之后证明了重新聚集政策的不变属性,从而提供了理论上的保证,以融合所采用的策略梯度方法的融合。与演示中现有的RL相比,我们在六个实验中,在六个实验中,在六个实验中,我们证明了学习-SAM方法的优越性及其鲁棒性,其鲁棒性具有不同的演示质量和稀疏性。
The current work on reinforcement learning (RL) from demonstrations often assumes the demonstrations are samples from an optimal policy, an unrealistic assumption in practice. When demonstrations are generated by sub-optimal policies or have sparse state-action pairs, policy learned from sub-optimal demonstrations may mislead an agent with incorrect or non-local action decisions. We propose a new method called Local Ensemble and Reparameterization with Split and Merge of expert policies (LEARN-SAM) to improve efficiency and make better use of the sub-optimal demonstrations. First, LEARN-SAM employs a new concept, the lambda-function, based on a discrepancy measure between the current state to demonstrated states to "localize" the weights of the expert policies during learning. Second, LEARN-SAM employs a split-and-merge (SAM) mechanism by separating the helpful parts in each expert demonstration and regrouping them into new expert policies to use the demonstrations selectively. Both the lambda-function and SAM mechanism help boost the learning speed. Theoretically, we prove the invariant property of reparameterized policy before and after the SAM mechanism, providing theoretical guarantees for the convergence of the employed policy gradient method. We demonstrate the superiority of the LEARN-SAM method and its robustness with varying demonstration quality and sparsity in six experiments on complex continuous control problems of low to high dimensions, compared to existing methods on RL from demonstration.