FCAFORMER：混合视觉变压器中的正面关注

论文标题

FCAFORMER：混合视觉变压器中的正面关注

Fcaformer: Forward Cross Attention in Hybrid Vision Transformer

论文作者

Zhang, Haokui, Hu, Wenze, Wang, Xiaoyu

论文摘要

当前，设计更有效的视觉变压器的一项主要研究线是通过采用稀疏注意或使用本地注意力窗口来降低自我注意力模块的计算成本。相比之下，我们提出了一种不同的方法，旨在通过致力于关注模式来提高基于变形金刚的体系结构的性能。具体而言，我们提出了Hybrid Vision Transformer（FCAFORMER）的前进性交叉注意，其中使用了同一阶段的以前块的令牌。为此，FCAFormer利用了两个创新组件：可学习的量表因子（LSF）和代币合并和增强模块（TME）。 LSF可以有效地处理跨令牌，而TME产生了代表性的跨令牌。通过集成这些组件，提出的FCAFORMER可以增强具有潜在不同语义的块的令牌的相互作用，并鼓励更多的信息流到较低级别。根据前向交叉注意（FCA），我们设计了一系列的FCAFormer模型，这些模型在模型大小，计算成本，内存成本和准确性之间实现最佳折衷。例如，如果不需要知识蒸馏来加强训练，我们的FCAFORMER只有1630万个参数和约36亿Mac的ImageNet上的TOP-1准确性。这节省了几乎一半的参数和几个计算成本，而与蒸馏效应器相比，精度提高了0.7％。

Currently, one main research line in designing a more efficient vision transformer is reducing the computational cost of self attention modules by adopting sparse attention or using local attention windows. In contrast, we propose a different approach that aims to improve the performance of transformer-based architectures by densifying the attention pattern. Specifically, we proposed forward cross attention for hybrid vision transformer (FcaFormer), where tokens from previous blocks in the same stage are secondary used. To achieve this, the FcaFormer leverages two innovative components: learnable scale factors (LSFs) and a token merge and enhancement module (TME). The LSFs enable efficient processing of cross tokens, while the TME generates representative cross tokens. By integrating these components, the proposed FcaFormer enhances the interactions of tokens across blocks with potentially different semantics, and encourages more information flows to the lower levels. Based on the forward cross attention (Fca), we have designed a series of FcaFormer models that achieve the best trade-off between model size, computational cost, memory cost, and accuracy. For example, without the need for knowledge distillation to strengthen training, our FcaFormer achieves 83.1% top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.6 billion MACs. This saves almost half of the parameters and a few computational costs while achieving 0.7% higher accuracy compared to distilled EfficientFormer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题