论文标题
连续变压器:在线推断的无冗余注意力
Continual Transformers: Redundancy-Free Attention for Online Inference
论文作者
论文摘要
其共同形式的变压器本质上仅限于整个令牌序列的操作,而不是一次执行一个令牌。因此,由于连续的令牌序列的重叠,它们在在线推断时间序列数据中的使用需要相当大的冗余。在这项工作中,我们提出了缩放点产生关注的新型配方,这使变压器能够在连续的输入流上执行有效的在线令牌推断。重要的是,我们的修改纯粹是计算顺序,而输出和学习的权重与原始变压器编码器的权重相同。我们通过在Thumos14,TVSeries和Gtzan数据集上进行的实验来验证我们的连续变压器编码器,并取得了显着的结果:我们的连续一式和两个块体系结构分别将每个预测的浮点操作降低了63倍和2.6倍,同时保持预测性能。
Transformers in their common form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference on time-series data entails considerable redundancy due to the overlap in successive token sequences. In this work, we propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference on a continual input stream. Importantly, our modifications are purely to the order of computations, while the outputs and learned weights are identical to those of the original Transformer Encoder. We validate our Continual Transformer Encoder with experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable results: Our Continual one- and two-block architectures reduce the floating point operations per prediction by up to 63x and 2.6x, respectively, while retaining predictive performance.