向量 - 矢量 - 马trix体系结构：NLP应用中低延迟推断的新型硬件感知框架

论文标题

向量 - 矢量 - 马trix体系结构：NLP应用中低延迟推断的新型硬件感知框架

Vector-Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference in NLP Applications

论文作者

Khoury, Matthew, Dangovski, Rumen, Ou, Longwu, Nakov, Preslav, Shen, Yichen, Jing, Li

论文摘要

深度神经网络已成为构建可靠的自然语言处理（NLP）应用程序的标准方法，从神经机器翻译（NMT）到对话系统。但是，通过增加模型大小来提高准确性需要大量的硬件计算，这可以在推理时大大减慢NLP应用程序。为了解决这个问题，我们提出了一种新颖的矢量 - 矢量 - 矩阵体系结构（VVMA），该体系结构（VVMA）大大减少了NMT推理时间的延迟。该体系结构利用具有低延迟矢量矢量操作和更高延迟矢量 - 摩trix操作的专业硬件。它还减少了几乎所有依赖有效矩阵乘数的模型的参数和拖船的数量，而不会显着影响准确性。我们提出了经验结果，表明我们的框架可以减少用于NMT的序列到序列和变压器模型的潜伏期，其中四倍。最后，我们展示了证据，表明我们的VVMA扩展到其他领域，并讨论了新颖的硬件以有效使用。

Deep neural networks have become the standard approach to building reliable Natural Language Processing (NLP) applications, ranging from Neural Machine Translation (NMT) to dialogue systems. However, improving accuracy by increasing the model size requires a large number of hardware computations, which can slow down NLP applications significantly at inference time. To address this issue, we propose a novel vector-vector-matrix architecture (VVMA), which greatly reduces the latency at inference time for NMT. This architecture takes advantage of specialized hardware that has low-latency vector-vector operations and higher-latency vector-matrix operations. It also reduces the number of parameters and FLOPs for virtually all models that rely on efficient matrix multipliers without significantly impacting accuracy. We present empirical results suggesting that our framework can reduce the latency of sequence-to-sequence and Transformer models used for NMT by a factor of four. Finally, we show evidence suggesting that our VVMA extends to other domains, and we discuss novel hardware for its efficient use.

下载PDF全文

下载文献需遵守相关版权规定

论文标题