通过FIXMATCH的半监督学习中标记为数据的过度采样的分析

论文标题

通过FIXMATCH的半监督学习中标记为数据的过度采样的分析

An analysis of over-sampling labeled data in semi-supervised learning with FixMatch

论文作者

Rabadán, Miquel Martí i, Bujwid, Sebastian, Pieropan, Alessandro, Azizpour, Hossein, Maki, Atsuto

论文摘要

构建训练小批次时，大多数半监督学习方法都会过度样本标记的数据。本文研究了这种常见的实践是否改善了学习和方式。我们将其与一个替代设置进行比较，其中每个迷你批次都从所有标记的训练数据中均匀地取样，这会大大降低典型低标签制度中真正标签的直接监督。但是，这种简单的设置也可以看作是更通用的，甚至在多任务问题中必不可少的设置，在多任务问题中，过度采样标记的数据将变得棘手。我们对使用FIXMATCH的半监督CIFAR-10图像分类进行的实验显示了使用均匀采样方法时的性能下降，当标记的数据量或训练时间增加时，该方法会减少。此外，我们分析了训练动力学，以了解标记数据的过度采样与均匀采样相比。我们的主要发现是，在培训中过度采样尤其有益，但在更高的伪标签变得正确时，在后期阶段变得不那么重要。然而，我们还发现，保留一些真正的标签对于避免错误的伪标记中的确认错误积累仍然很重要。

Most semi-supervised learning methods over-sample labeled data when constructing training mini-batches. This paper studies whether this common practice improves learning and how. We compare it to an alternative setting where each mini-batch is uniformly sampled from all the training data, labeled or not, which greatly reduces direct supervision from true labels in typical low-label regimes. However, this simpler setting can also be seen as more general and even necessary in multi-task problems where over-sampling labeled data would become intractable. Our experiments on semi-supervised CIFAR-10 image classification using FixMatch show a performance drop when using the uniform sampling approach which diminishes when the amount of labeled data or the training time increases. Further, we analyse the training dynamics to understand how over-sampling of labeled data compares to uniform sampling. Our main finding is that over-sampling is especially beneficial early in training but gets less important in the later stages when more pseudo-labels become correct. Nevertheless, we also find that keeping some true labels remains important to avoid the accumulation of confirmation errors from incorrect pseudo-labels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题