解决GPU加速超级计算机上的大型置换流程度调度问题

论文标题

解决GPU加速超级计算机上的大型置换流程度调度问题

Solving large permutation flow-shop scheduling problems on GPU-accelerated supercomputers

论文作者

Gmys, Jan

论文摘要

使置换流程调度的置换量最小化是众所周知的硬组合优化问题。在E. Taillard在1993年提出的120个标准基准实例中，近三十年来一直未解决23个。在本文中，我们介绍了在GPU加速的Jean Zay SuperComputer上使用并行分支和结合的树搜索来解决这些实例以最佳的尝试。我们报告了11个先前未解决的问题实例的精确解决方案，并改善了8个实例的上限。这些问题的解决方案既需要算法改进，也需要利用PETA级高性能计算平台的计算能力。挑战包括有效地执行由数百个由数百个大量平行的加速器设备和多核处理器组成的分布式系统上的高度不规则，细粒度搜索树的平行深度优先遍历。我们介绍并讨论了基于置换的B＆B的设计和实施，并实验评估了其在高达384 V100 GPU（200万CUDA内核）和3840 CPU内核上的并行性能。最大解决实例的最佳证明需要约64个CPU年的计算 - 使用256 GPU和超过400万平行搜索剂，搜索树的遍历在13小时内完成，探索339 $ \ times 10^{12} $ nodes。

Makespan minimization in permutation flow-shop scheduling is a well-known hard combinatorial optimization problem. Among the 120 standard benchmark instances proposed by E. Taillard in 1993, 23 have remained unsolved for almost three decades. In this paper, we present our attempts to solve these instances to optimality using parallel Branch-and-Bound tree search on the GPU-accelerated Jean Zay supercomputer. We report the exact solution of 11 previously unsolved problem instances and improved upper bounds for 8 instances. The solution of these problems requires both algorithmic improvements and leveraging the computing power of peta-scale high-performance computing platforms. The challenge consists in efficiently performing parallel depth-first traversal of a highly irregular, fine-grained search tree on distributed systems composed of hundreds of massively parallel accelerator devices and multi-core processors. We present and discuss the design and implementation of our permutation-based B&B and experimentally evaluate its parallel performance on up to 384 V100 GPUs (2 million CUDA cores) and 3840 CPU cores. The optimality proof for the largest solved instance requires about 64 CPU-years of computation -- using 256 GPUs and over 4 million parallel search agents, the traversal of the search tree is completed in 13 hours, exploring 339$\times 10^{12}$ nodes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题