代表学习的自我监督排名

论文标题

代表学习的自我监督排名

Self-Supervised Ranking for Representation Learning

论文作者

Varamesh, Ali, Diba, Ali, Tuytelaars, Tinne, Van Gool, Luc

论文摘要

我们通过将其作为图像检索上下文中的排名问题提出在从图像中获得的大量随机视图（增强）中，通过将其作为排名问题来提出一个新的框架。我们的工作基于两个直觉：首先，良好的图像表示必须在检索任务中产生高质量的图像排名；其次，我们希望图像的随机视图比其他图像的随机视图更接近该图像的参考视图。因此，我们将学习的模型模型为学习，以排名图像检索的问题。我们通过最大化平均精度（AP）来训练一个代表编码器，其中图像的随机视图被视为正相关，而其他图像的视图则被视为否定。与流行的对比度学习框架中的本地目标相比，该新框架称为S2R2，可以计算出多种视图的全球目标，该目标是根据成对的视图计算的。原则上，通过使用排名标准，我们消除了对以对象为中心的策划数据集的依赖。当在STL10和MS-Coco上接受培训时，S2R2优于SIMCLR和基于聚类的对比学习模型SWAV，同时在概念上和实现时都更加简单。在MS-Coco上，S2R2的优于SWAV和SIMCLR的余量要比STL10更大。这表明S2R2在不同的场景中更有效，并且可以消除以对象为中心的大型培训数据集以进行自我监督的代表学习。

We present a new framework for self-supervised representation learning by formulating it as a ranking problem in an image retrieval context on a large number of random views (augmentations) obtained from images. Our work is based on two intuitions: first, a good representation of images must yield a high-quality image ranking in a retrieval task; second, we would expect random views of an image to be ranked closer to a reference view of that image than random views of other images. Hence, we model representation learning as a learning to rank problem for image retrieval. We train a representation encoder by maximizing average precision (AP) for ranking, where random views of an image are considered positively related, and that of the other images considered negatives. The new framework, dubbed S2R2, enables computing a global objective on multiple views, compared to the local objective in the popular contrastive learning framework, which is calculated on pairs of views. In principle, by using a ranking criterion, we eliminate reliance on object-centric curated datasets. When trained on STL10 and MS-COCO, S2R2 outperforms SimCLR and the clustering-based contrastive learning model, SwAV, while being much simpler both conceptually and at implementation. On MS-COCO, S2R2 outperforms both SwAV and SimCLR with a larger margin than on STl10. This indicates that S2R2 is more effective on diverse scenes and could eliminate the need for an object-centric large training dataset for self-supervised representation learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题