OneFormer：一个用于统治通用图像分割的变压器

论文标题

OneFormer：一个用于统治通用图像分割的变压器

OneFormer: One Transformer to Rule Universal Image Segmentation

论文作者

Jain, Jitesh, Li, Jiachen, Chiu, MangTik, Hassani, Ali, Orlov, Nikita, Shi, Humphrey

论文摘要

通用图像分割不是一个新概念。过去几十年来统一图像分割的过去尝试包括场景解析，全景分割，以及最近的新的全景体系结构。但是，这样的泛型体系结构并不能真正统一图像分割，因为需要在语义，实例或泛型分段中单独培训它们，以实现最佳性能。理想情况下，真正的通用框架应仅接受一次训练，并在所有三个图像分割任务中实现SOTA性能。为此，我们提出了OneFormer，这是一种通用图像分割框架，该框架将分段统一使用多任务列车执行任务设计。我们首先提出了一个由任务条件的联合培训策略，该策略能够在一个多任务培训过程中对每个领域（语义，实例和全景进行）的地面真相进行培训。其次，我们介绍了一个任务令牌，以将模型在手头的任务上调节，从而使我们的模型任务动态以支持多任务培训和推理。第三，我们建议在训练过程中使用查询文本对比损失，以建立更好的任务间和类间的区别。值得注意的是，我们的单个OneFormer模型在ADE20K，CityScapes和Coco上的所有三个细分任务中都优于专业的Mask2Former模型，尽管后者对三个任务进行了三个任务的培训，并以三倍的资源进行了培训。借助新的Convnext和Dinat骨干，我们观察到更大的性能提高。我们认为，OneFormer是使图像分割更加通用和易于访问的重要一步。为了支持进一步的研究，我们在https://github.com/shi-labs/oneformer上开放代码和模型

Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible. To support further research, we open-source our code and models at https://github.com/SHI-Labs/OneFormer

下载PDF全文

下载文献需遵守相关版权规定

论文标题