GPA：基于指令抽样的GPU绩效顾问

论文标题

GPA：基于指令抽样的GPU绩效顾问

GPA: A GPU Performance Advisor Based on Instruction Sampling

论文作者

Zhou, Keren, Meng, Xiaozhu, Sai, Ryuichi, Mellor-Crummey, John

论文摘要

由于GPU体系结构和编程模型的复杂性，开发有效的GPU内核可能很困难。现有的性能工具仅在内核级别提供粗粒的建议（如果有）。在本文中，我们描述了NVIDIA GPU的绩效顾问GPA，建议在层次的层次结构（包括单个线路，循环和功能）上提出潜在的代码优化机会。为了减轻用户解释性能计数器和分析瓶颈的负担，GPA使用数据流量分析将测量的指令档案大约归因于其根本原因，并使用有关程序结构和GPU的信息，以匹配效率低下模式，并与建议进行优化。为了量化每个建议的潜在收益，我们开发了基于PC的采样模型来估计其加速。我们对基准和应用程序进行的实验表明，GPA提供了一份有见地的报告来指导性能优化。使用GPA，我们在Volta V100 GPU上获得了速度，范围从1.01 $ \ times $到3.53 $ \ times $，几何平均值为1.22 $ \ times $。

Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained suggestions at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimization opportunities at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program's structure and the GPU to match inefficiency patterns with suggestions for optimization. To quantify each suggestion's potential benefits, we developed PC sampling-based performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides an insightful report to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01$\times$ to 3.53$\times$, with a geometric mean of 1.22$\times$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题