加快大规模视觉变压器进行密集预测而无需微调

论文标题

加快大规模视觉变压器进行密集预测而无需微调

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

论文作者

Liang, Weicong, Yuan, Yuhui, Ding, Henghui, Luo, Xiao, Lin, Weihong, Jia, Ding, Zhang, Zheng, Zhang, Chao, Hu, Han

论文摘要

视觉变形金刚最近在各种视觉任务中取得了竞争成果，但处理大量令牌时仍处于沉重的计算成本。已经开发了许多高级方法，以减少大规模视觉变压器中的令牌总数，尤其是用于图像分类任务。通常，他们根据与类令牌的相关性选择一小部分必需令牌，然后微调视觉变压器的权重。由于计算更重的计算和GPU内存成本，这种微调对密集预测的实用性不大。在本文中，我们专注于一个更具挑战性的问题，即加速大规模视觉变压器进行密集预测，而无需进行任何其他重新训练或微调。为了回应以下事实：高分辨率表示对于密集预测是必需的，我们提出了两个非参数算子，即一个令牌聚类层，以减少令牌数量和一个令牌重建层以增加令牌的数量。执行以下步骤以实现这一目标：（i）我们使用令牌群集层将相邻令牌聚集在一起，从而导致低分辨率表示，以维持空间结构；（ii）我们仅将以下变压器层应用于这些低分辨率表示或聚类的令牌；（iii）我们使用令牌重建层从精制的低分辨率表示中重新创建高分辨率表示形式。通过我们的方法获得的结果有望在五个密集的预测任务上，包括对象检测，语义分割，全磁带分割，实例分割和深度估计。

Vision transformers have recently achieved competitive results across various vision tasks but still suffer from heavy computation costs when processing a large number of tokens. Many advanced approaches have been developed to reduce the total number of tokens in large-scale vision transformers, especially for image classification tasks. Typically, they select a small group of essential tokens according to their relevance with the class token, then fine-tune the weights of the vision transformer. Such fine-tuning is less practical for dense prediction due to the much heavier computation and GPU memory cost than image classification. In this paper, we focus on a more challenging problem, i.e., accelerating large-scale vision transformers for dense prediction without any additional re-training or fine-tuning. In response to the fact that high-resolution representations are necessary for dense prediction, we present two non-parametric operators, a token clustering layer to decrease the number of tokens and a token reconstruction layer to increase the number of tokens. The following steps are performed to achieve this: (i) we use the token clustering layer to cluster the neighboring tokens together, resulting in low-resolution representations that maintain the spatial structures; (ii) we apply the following transformer layers only to these low-resolution representations or clustered tokens; and (iii) we use the token reconstruction layer to re-create the high-resolution representations from the refined low-resolution representations. The results obtained by our method are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题