使用OpenACC在结构化网格上的CFD应用程序的GPU计算的改进框架

论文标题

使用OpenACC在结构化网格上的CFD应用程序的GPU计算的改进框架

An Improved Framework of GPU Computing for CFD Applications on Structured Grids using OpenACC

论文作者

Xue, Weicheng, Jackson, Charles W., Roy, Christoper J.

论文摘要

本文的重点是改善有关结构化网格的研究CFD代码的多GPU性能。 MPI和OpenACC指令用于将代码扩展到16 GPU。本文表明，使用16个P100 GPU和16个V100 GPU可以分别比16 Xeon CPU E5-2680V4内核快30 $ \ times $ \ times $ \ times $ \ times $。通过应用各种优化来解决与多块CFD代码缩放有关的一系列性能问题。 Performance optimizations such as the pack/unpack message method, removing temporary arrays as arguments to procedure calls, allocating global memory for limiters and connected boundary data, reordering non-blocking MPI I\_send/I\_recv and Wait calls, reducing unnecessary implicit derived type member data movement between the host and the device and the use of GPUDirect can improve the compute utilization, memory throughput, and使用现代编程功能，多块CFD代码中的异步进度。

This paper is focused on improving multi-GPU performance of a research CFD code on structured grids. MPI and OpenACC directives are used to scale the code up to 16 GPUs. This paper shows that using 16 P100 GPUs and 16 V100 GPUs can be 30$\times$ and 70$\times$ faster than 16 Xeon CPU E5-2680v4 cores for three different test cases, respectively. A series of performance issues related to the scaling for the multi-block CFD code are addressed by applying various optimizations. Performance optimizations such as the pack/unpack message method, removing temporary arrays as arguments to procedure calls, allocating global memory for limiters and connected boundary data, reordering non-blocking MPI I\_send/I\_recv and Wait calls, reducing unnecessary implicit derived type member data movement between the host and the device and the use of GPUDirect can improve the compute utilization, memory throughput, and asynchronous progression in the multi-block CFD code using modern programming features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题