论文标题
改进掩盖图像建模的自我监督的表示学习
Improvements to Self-Supervised Representation Learning for Masked Image Modeling
论文作者
论文摘要
本文探讨了蒙版图像建模(MIM)范式的改进。 MIM范式使模型能够通过掩盖输入图像并通过未掩盖的部分来了解图像的主要对象特征。我们发现以下三个主要方向可以改善MIM。首先,由于编码器和解码器都促进了表示学习的贡献,因此MIM仅将编码器用于下游任务,这忽略了解码器对表示学习的影响。尽管MIM范式已经采用了不对称结构的小型解码器,但我们认为,解码器参数的持续减少对提高编码器的代表性学习能力是有益的。其次,MIM通过一起训练编码器和解码器来解决图像预测任务,并且没有为编码器设计单独的任务。为了进一步提高执行下游任务时编码器的性能,我们为比较学习和令牌位置预测的任务设计了编码器。第三,由于输入图像可能包含背景和其他对象,并且图像中每个对象的比例各不相同,因此重建与背景相关的令牌或与其他对象相关的令牌对于MIM了解主要对象表示并不有意义。因此,我们使用ContrastiveCrop裁剪输入图像,以便输入图像仅包含主要对象。基于上述MIM的三个改进,我们提出了一个新的模型,对比度掩盖自动编码器(CMAE)。我们使用VIT-B主链在Tinyimagenet上获得了65.84%的前1个准确性,当所有条件均等时,+2.89的表现优于竞争方法的MAE。代码将提供。
This paper explores improvements to the masked image modeling (MIM) paradigm. The MIM paradigm enables the model to learn the main object features of the image by masking the input image and predicting the masked part by the unmasked part. We found the following three main directions for MIM to be improved. First, since both encoders and decoders contribute to representation learning, MIM uses only encoders for downstream tasks, which ignores the impact of decoders on representation learning. Although the MIM paradigm already employs small decoders with asymmetric structures, we believe that continued reduction of decoder parameters is beneficial to improve the representational learning capability of the encoder . Second, MIM solves the image prediction task by training the encoder and decoder together , and does not design a separate task for the encoder . To further enhance the performance of the encoder when performing downstream tasks, we designed the encoder for the tasks of comparative learning and token position prediction. Third, since the input image may contain background and other objects, and the proportion of each object in the image varies, reconstructing the tokens related to the background or to other objects is not meaningful for MIM to understand the main object representations. Therefore we use ContrastiveCrop to crop the input image so that the input image contains as much as possible only the main objects. Based on the above three improvements to MIM, we propose a new model, Contrastive Masked AutoEncoders (CMAE). We achieved a Top-1 accuracy of 65.84% on tinyimagenet using the ViT-B backbone, which is +2.89 outperforming the MAE of competing methods when all conditions are equal. Code will be made available.