Mumic-多模式嵌入用于多标签图像分类的sigmoid

论文标题

Mumic-多模式嵌入用于多标签图像分类的sigmoid

MuMIC -- Multimodal Embedding for Multi-label Image Classification with Tempered Sigmoid

论文作者

Wang, Fengjun, Mizrachi, Sarai, Beladev, Moran, Nadav, Guy, Amsalem, Gil, Assaraf, Karen Lastmann, Boker, Hadas Harush

论文摘要

多标签图像分类是各个域中的基础主题。多模式学习方法最近在图像表示和单标图像分类方面取得了出色的成果。例如，对比性语言图像预审计（剪辑）表现出令人印象深刻的图像文本表示能力，并且对自然分布的变化具有鲁棒性。这一成功激发了我们利用多模式学习来完成多标签分类任务，并从对比学习的预算模型中受益。我们提出了多模式的多标签图像分类（MUMIC）框架，该框架利用了基于硬度意识的基于硬度的sigmoid二进制交叉熵损耗函数，因此可以优化多标签目标和对剪辑的传输学习。 Mumic能够提供高分类性能，处理现实世界嘈杂的数据，支持零射击预测并产生特定于域的图像嵌入。在这项研究中，总共定义了120个图像类，并在大约60k booking.com图像上收集了超过140k的正面注释。最终的MUMIC模型在Booking.com内容智能平台上部署，并且在所有120个类别上的差距为85.6％，差距为85.6％，差距为83.8％，在32个多数级别上的宏观地图得分为90.1％。我们总结了通过消融研究进行广泛测试的建模选择。据我们所知，我们是第一个针对现实世界中多标签图像分类问题的多模式预处理的人，并且可以将创新转移到其他领域。

Multi-label image classification is a foundational topic in various domains. Multimodal learning approaches have recently achieved outstanding results in image representation and single-label image classification. For instance, Contrastive Language-Image Pretraining (CLIP) demonstrates impressive image-text representation learning abilities and is robust to natural distribution shifts. This success inspires us to leverage multimodal learning for multi-label classification tasks, and benefit from contrastively learnt pretrained models. We propose the Multimodal Multi-label Image Classification (MuMIC) framework, which utilizes a hardness-aware tempered sigmoid based Binary Cross Entropy loss function, thus enables the optimization on multi-label objectives and transfer learning on CLIP. MuMIC is capable of providing high classification performance, handling real-world noisy data, supporting zero-shot predictions, and producing domain-specific image embeddings. In this study, a total of 120 image classes are defined, and more than 140K positive annotations are collected on approximately 60K Booking.com images. The final MuMIC model is deployed on Booking.com Content Intelligence Platform, and it outperforms other state-of-the-art models with 85.6% GAP@10 and 83.8% GAP on all 120 classes, as well as a 90.1% macro mAP score across 32 majority classes. We summarize the modeling choices which are extensively tested through ablation studies. To the best of our knowledge, we are the first to adapt contrastively learnt multimodal pretraining for real-world multi-label image classification problems, and the innovation can be transferred to other domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题