开关轨迹变压器具有分布值近似的多任务增强学习

论文标题

开关轨迹变压器具有分布值近似的多任务增强学习

Switch Trajectory Transformer with Distributional Value Approximation for Multi-Task Reinforcement Learning

论文作者

Lin, Qinjie, Liu, Han, Sengupta, Biswa

论文摘要

我们提出了SwitchTT，这是轨迹变压器的多任务扩展，但具有两个引人注目的功能：（i）利用稀疏激活的模型来降低多任务离线模型学习中的计算成本，并且（ii）采用分配轨迹值估算器，可改善策略性能，尤其是在稀疏奖励设置中。这两种增强功能使SwitchTT适合解决多任务脱机加强学习问题，在这种情况下，模型容量对于吸收多任务数据集中可用的大量知识至关重要。更具体地说，SwitchTT利用了用于多任务策略学习的Switch Tronswormer模型体系结构，使我们能够在不成比例计算成本的情况下提高模型容量。同样，SwitchTT近似于轨迹值的分布而不是期望，从而减轻了蒙特 - 卡洛值估计量的影响，而较差的样本复杂性，尤其是在稀疏的奖励环境中。我们使用从Gym-Mini-Grid环境中的十个稀疏奖励任务组成的套件来评估我们的方法。我们在10任任务学习中显示出比轨迹变压器的10％改善，并在离线模型训练速度中提高了90％。我们的结果还证明了开关变压器模型的优势，用于吸收专家知识和价值分布在评估轨迹中的重要性。

We propose SwitchTT, a multi-task extension to Trajectory Transformer but enhanced with two striking features: (i) exploiting a sparsely activated model to reduce computation cost in multi-task offline model learning and (ii) adopting a distributional trajectory value estimator that improves policy performance, especially in sparse reward settings. These two enhancements make SwitchTT suitable for solving multi-task offline reinforcement learning problems, where model capacity is critical for absorbing the vast quantities of knowledge available in the multi-task dataset. More specifically, SwitchTT exploits switch transformer model architecture for multi-task policy learning, allowing us to improve model capacity without proportional computation cost. Also, SwitchTT approximates the distribution rather than the expectation of trajectory value, mitigating the effects of the Monte-Carlo Value estimator suffering from poor sample complexity, especially in the sparse-reward setting. We evaluate our method using the suite of ten sparse-reward tasks from the gym-mini-grid environment.We show an improvement of 10% over Trajectory Transformer across 10-task learning and obtain up to 90% increase in offline model training speed. Our results also demonstrate the advantage of the switch transformer model for absorbing expert knowledge and the importance of value distribution in evaluating the trajectory.

下载PDF全文

下载文献需遵守相关版权规定

论文标题