我的邻居在哪里？利用自我监督视觉变压器中的补丁关系

论文标题

我的邻居在哪里？利用自我监督视觉变压器中的补丁关系

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

论文作者

Camporese, Guglielmo, Izzo, Elena, Ballan, Lamberto

论文摘要

视觉变形金刚（VIT）启用了在大数据集中训练的视觉任务上的变压器体系结构对视觉任务的使用。但是，在相对较小的数据集上，鉴于其缺乏感应性偏差，VIT的准确性较差。为此，我们提出了一种简单但仍有有效的自我监督学习（SSL）策略来训练VIT，即没有任何外部注释或外部数据，可以显着改善结果。具体而言，我们根据模型必须在或共同执行监督任务必须解决的图像补丁的关系来定义一组SSL任务。与VIT不同，我们的Relvit模型优化了与图像贴片相关的变压器编码器的所有输出令牌，从而在每个训练步骤中利用更多的训练信号。我们研究了几个图像基准测试的方法，发现Relvit将SSL最先进的方法提高了很大的边距，尤其是在小型数据集上。代码可在以下网址找到：https：//github.com/guglielmocamporese/relvit。

Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly the supervised task. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signals at each training step. We investigated our methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets. Code is available at: https://github.com/guglielmocamporese/relvit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题