超越掩盖：对视觉变压器的基于令牌的预训练

论文标题

超越掩盖：对视觉变压器的基于令牌的预训练

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

论文作者

Tian, Yunjie, Xie, Lingxi, Fang, Jiemin, Shi, Mengnan, Peng, Junran, Zhang, Xiaopeng, Jiao, Jianbin, Tian, Qi, Ye, Qixiang

论文摘要

过去的一年见证了蒙面图像建模（MIM）的迅速发展。 MIM主要建立在视觉变压器上，这表明可以通过掩盖输入图像部分来完成自我监督的视觉表示，同时要求目标模型恢复缺失的内容。 MIM在下游任务上表现出了令人鼓舞的结果，但是我们对是否存在其他有效的方法来“通过恢复缺失的内容”学习。在本文中，我们通过设计其他五个学习目标来调查该主题，这些学习目标遵循与MIM相同的过程，但以不同的方式降低了输入图像。通过广泛的实验，我们设法总结了一些基于令牌的视觉变压器预训练的设计原则。特别是，最好的做法是通过保持原始图像样式并通过空间不对对准的空间掩盖来获得最佳实践 - 在没有额外计算成本的一系列下游识别任务中，这种设计在一系列下游识别任务中获得了优于MIM的表现。该代码可从https://github.com/sunsmarterjie/beyond_masking获得。

The past year has witnessed a rapid development of masked image modeling (MIM). MIM is mostly built upon the vision transformers, which suggests that self-supervised visual representations can be done by masking input image parts while requiring the target model to recover the missing contents. MIM has demonstrated promising results on downstream tasks, yet we are interested in whether there exist other effective ways to `learn by recovering missing contents'. In this paper, we investigate this topic by designing five other learning objectives that follow the same procedure as MIM but degrade the input image in different ways. With extensive experiments, we manage to summarize a few design principles for token-based pre-training of vision transformers. In particular, the best practice is obtained by keeping the original image style and enriching spatial masking with spatial misalignment -- this design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost. The code is available at https://github.com/sunsmarterjie/beyond_masking.

下载PDF全文

下载文献需遵守相关版权规定

论文标题