论文标题
Clippo:仅来自像素的图像和语言理解
CLIPPO: Image-and-Language Understanding from Pixels Only
论文作者
论文摘要
多模型模型变得越来越有效,部分是由于统一组件(例如变压器体系结构)。但是,多模式模型通常仍然由许多特定于任务和模式的作品和培训程序组成。例如,剪辑(Radford等,2021)通过对比损失训练独立的文本和图像塔。我们探讨了一个额外的统一:使用基于纯像素的模型执行图像,文本和多模式任务。我们的模型仅通过对比度损失而受到训练,因此我们将其称为夹子像素(Clippo)。 Clippo使用单个编码器,该编码器同时处理常规图像和呈现为图像的文本。 Clippo几乎执行基于图像的任务,例如检索和零拍图像分类,以及夹子式模型,参数数量的一半,没有特定于文本的塔或嵌入。当通过图像文本对比学习和下一句子对比学习进行联合培训时,Clippo可以在自然语言理解任务上表现良好,而无需任何单词级别的损失(语言建模或掩盖语言建模),优于基于像素的先前工作。令人惊讶的是,Clippo可以通过将问题和图像一起呈现,可以在视觉问题回答中获得良好的准确性。最后,我们利用了一个事实,即克利波不需要令牌化器即可证明它可以在不修改的情况下在多语言多模式检索上实现强大的性能。
Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP-style models, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without modifications.