非对比度学习符合语言图像预训练

论文标题

非对比度学习符合语言图像预训练

Non-Contrastive Learning Meets Language-Image Pre-Training

论文作者

Zhou, Jinghao, Dong, Li, Gan, Zhe, Wang, Lijuan, Wei, Furu

论文摘要

对比语言图像预训练（剪辑）是对齐图像和文本的事实上的标准。尽管如此，图像和网络爬行数据的文本之间的宽松相关性使对比目标数据效率低下，并渴望大型培训批量的大小。在这项工作中，我们探讨了非对抗性语言图像预训练（NCLIP）的有效性，并研究了在视觉自我监督模型中表现出的良好特性是否会出现。我们从经验上观察到，非对抗性客观的滋养表示学习，而在零射击识别下表现不佳。基于上述研究，我们进一步介绍了XCLIP，这是一个将剪辑和NCLIP结合的多任务框架，并证明NCLIP有助于增强功能语义的剪辑。两个目标之间的协同作用使XClip享受两全其美的最好：在零射击转移和表示学习中的出色表现。进行系统的评估，涵盖了各种下游任务，包括零拍，室外分类，检索，视觉表示学习和文本表示学习，展示一致的性能增益并验证XCLIP的有效性。

Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and craving for a large training batch size. In this work, we explore the validity of non-contrastive language-image pre-training (nCLIP), and study whether nice properties exhibited in visual self-supervised models can emerge. We empirically observe that the non-contrastive objective nourishes representation learning while sufficiently underperforming under zero-shot recognition. Based on the above study, we further introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics. The synergy between two objectives lets xCLIP enjoy the best of both worlds: superior performance in both zero-shot transfer and representation learning. Systematic evaluation is conducted spanning a wide variety of downstream tasks including zero-shot classification, out-of-domain classification, retrieval, visual representation learning, and textual representation learning, showcasing a consistent performance gain and validating the effectiveness of xCLIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题