论文标题
部分可观测时空混沌系统的无模型预测
CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization
论文作者
论文摘要
弱监督的对象本地化(WSOL)是仅通过类别标签本地化对象的挑战性任务。但是,分类和本地化之间存在矛盾,因为准确的分类网络倾向于注意对象的歧视区域,而不是整体。我们提出这种歧视是通过基于CAM的方法选择手工阈值引起的。因此,我们提出了带有视觉变压器(VIT)骨干的令牌(CAFT)的聚类和过滤器,以另一种方式解决此问题。 Caft首先将图像的贴片令牌发送到VIT并将输出令牌聚集以生成对象的初始掩码。其次,CAFT将初始掩码视为伪标签,以训练骨干后训练浅卷积头(注意过滤器,ATF),以直接从令牌中提取口罩。然后,CAFT将图像分为部分,分别输出掩码并将其合并为一个精制的掩码。最后,在精制面具上训练了一个新的ATF,并用于预测对象的框。实验验证了CAFT的表现是否优于先前的工作,并且分别在CUB-200和Imagenet-1K上分别获得了97.55 \%和69.86 \%的定位精度。 CAFT提供了一种考虑WSOL任务的新方法。
Weakly supervised object localization (WSOL) is a challenging task to localize the object by only category labels. However, there is contradiction between classification and localization because accurate classification network tends to pay attention to discriminative region of objects rather than the entirety. We propose this discrimination is caused by handcraft threshold choosing in CAM-based methods. Therefore, we propose Clustering and Filter of Tokens (CaFT) with Vision Transformer (ViT) backbone to solve this problem in another way. CaFT first sends the patch tokens of the image split to ViT and cluster the output tokens to generate initial mask of the object. Secondly, CaFT considers the initial mask as pseudo labels to train a shallow convolution head (Attention Filter, AtF) following backbone to directly extract the mask from tokens. Then, CaFT splits the image into parts, outputs masks respectively and merges them into one refined mask. Finally, a new AtF is trained on the refined masks and used to predict the box of object. Experiments verify that CaFT outperforms previous work and achieves 97.55\% and 69.86\% localization accuracy with ground-truth class on CUB-200 and ImageNet-1K respectively. CaFT provides a fresh way to think about the WSOL task.