BytetransFormer：用于可变长度输入的高性能变压器

论文标题

BytetransFormer：用于可变长度输入的高性能变压器

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

论文作者

Zhai, Yujia, Jiang, Chengquan, Wang, Leyuan, Jia, Xiaoying, Zhang, Shang, Chen, Zizhong, Liu, Xin, Zhu, Yibo

论文摘要

在过去的十年中，变形金刚已成为自然语言处理中的Keystone模型。他们在深度学习应用程序中取得了广泛的欢迎，但是变压器模型所需的参数空间的尺寸增加却产生了相应的需求，以加速性能。自然语言处理问题也常规地面对可变的序列，因为单词在句子之间通常有所不同。现有的深度学习框架垫变量长度序列达到了最大长度，从而增加了内存和计算开销。在本文中，我们提出了ByteTransFormer，这是一种用于可变长度输入的高性能变压器。我们提出了一种无填充算法，该算法将整个变压器从零填充令牌上的冗余计算中解放出来。除了算法级优化外，我们还为变压器功能模块，尤其是性能至关重要的算法多头注意（MHA）提供架构感知的优化。具有可变的长度序列输入的NVIDIA A100 GPU的实验结果验证了我们的融合MHA的表现优于6.13x。向前BERT变压器的BytansFormer的端到端性能超过了最先进的变压器框架，例如Pytorch Jit，Tensorflow XLA，Microsoft Turbotransformer，Microsoft DeepSpeed-sperpreder和Nvidia FortStransFormer，由87 \％\％，131 \％，131 \％，138 \％，138 \％，131 \％，138 \％。我们还证明了我们的优化方法对其他类似Bert的模型的一般适用性，包括Albert，Distilbert和Deberta。

Transformers have become keystone models in natural language processing over the past decade. They have achieved great popularity in deep learning applications, but the increasing sizes of the parameter spaces required by transformer models generate a commensurate need to accelerate performance. Natural language processing problems are also routinely faced with variable-length sequences, as word counts commonly vary among sentences. Existing deep learning frameworks pad variable-length sequences to a maximal length, which adds significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a padding-free algorithm that liberates the entire transformer from redundant computations on zero padded tokens. In addition to algorithmic-level optimization, we provide architecture-aware optimizations for transformer functional modules, especially the performance-critical algorithm Multi-Head Attention (MHA). Experimental results on an NVIDIA A100 GPU with variable-length sequence inputs validate that our fused MHA outperforms PyTorch by 6.13x. The end-to-end performance of ByteTransformer for a forward BERT transformer surpasses state-of-the-art transformer frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer, Microsoft DeepSpeed-Inference and NVIDIA FasterTransformer, by 87\%, 131\%, 138\%, 74\% and 55\%, respectively. We also demonstrate the general applicability of our optimization methods to other BERT-like models, including ALBERT, DistilBERT, and DeBERTa.

下载PDF全文

下载文献需遵守相关版权规定

论文标题