使用冷冻视觉模型的开放式语义分割

论文标题

使用冷冻视觉模型的开放式语义分割

Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

论文作者

Ma, Chaofan, Yang, Yuhuan, Wang, Yanfeng, Zhang, Ya, Xie, Weidi

论文摘要

当经过足够的规模训练时，自学学习的学习表现出明显的解决能力，可以解决广泛的视觉或语言理解任务。在本文中，我们研究了简单但有效的方法，用于将预训练的基础模型调整为感兴趣的下游任务，即开放式语义语义分割。为此，我们做出以下贡献：（i）我们通过轻巧，基于变压器的融合模块介绍Fusioner，通过少数图像分割数据将冷冻视觉表示与语言概念配对。结果，该模型获得了零拍传递到片段新类别的能力。（ii）在没有一般性的情况下，我们对已通过不同方案进行了预先培训的广泛自我监督模型进行了实验，例如仅视觉模型（Moco V3，Dino），仅语言模型（BERT），视觉语言模型（剪辑），并表明，所提出的融合方法对任何一对视觉和语言模型都有效，甚至是在Uni-Mododal数据的语料库中进行培训的融合模型；（iii）我们进行彻底的消融研究，以分析我们提出的融合器中的关键组件，同时评估标准基准测试，例如Pascal-5i和Coco-20i，尽管仅接受了冷冻的视觉和语言特征训练，但它的利润很大，超过了现有的最新模型。（iv）为了测量模型在学习视觉语言对应方面的鲁棒性，我们进一步评估了名为Mosaic-4的合成数据集，其中图像是通过对FSS-1000的样品进行构造的。 Fusioner表现出优于先前模型的优越性能。

When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. In this paper, we investigate simple, yet effective approaches for adapting the pre-trained foundation models to the downstream task of interest, namely, open-vocabulary semantic segmentation. To this end, we make the following contributions: (i) we introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data. As a consequence, the model gains the capability of zero-shot transfer to segment novel categories; (ii) without loss of generality, we experiment on a broad range of self-supervised models that have been pre-trained with different schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT), visual-language model (CLIP), and show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data; (iii) we conduct thorough ablation studies to analyze the critical components in our proposed Fusioner, while evaluating on standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing state-of-the-art models by a large margin, despite only being trained on frozen visual and language features; (iv) to measure the model's robustness on learning visual-language correspondence, we further evaluate on synthetic dataset, named Mosaic-4, where images are constructed by mosaicking the samples from FSS-1000. Fusioner demonstrates superior performance over previous models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题