仅使用噪声剪辑的文字训练用于图像字幕

论文标题

仅使用噪声剪辑的文字训练用于图像字幕

Text-Only Training for Image Captioning using Noise-Injected CLIP

论文作者

Nukrai, David, Mokady, Ron, Globerson, Amir

论文摘要

我们仅在训练时考虑使用剪辑模型和其他文本数据的图像捕获的任务，也没有其他字幕图像。我们的方法依赖于训练夹以使视觉和文本嵌入类似的事实。因此，我们只需要学习如何将剪辑的文本嵌入方式转换回文本，我们可以通过仅使用文本学习冷冻剪辑文本编码器来学习如何做到这一点。我们认为，由于嵌入空间之间的差距，这种直觉是“几乎正确的”，并提议在训练过程中通过注入噪声来纠正这种情况。我们通过在四个基准（包括样式转移）上显示SOTA零击图像字幕来证明我们的方法的有效性。代码，数据和模型可在GitHub上找到。

We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available on GitHub.

下载PDF全文

下载文献需遵守相关版权规定

论文标题