混音教学：单眼3D对象检测的简单，统一和有效的半监督学习框架

论文标题

混音教学：单眼3D对象检测的简单，统一和有效的半监督学习框架

Mix-Teaching: A Simple, Unified and Effective Semi-Supervised Learning Framework for Monocular 3D Object Detection

论文作者

Yang, Lei, Zhang, Xinyu, Wang, Li, Zhu, Minghan, Zhang, Chuang, Li, Jun

论文摘要

单眼3D对象检测是自动驾驶的重要感知任务。但是，对大型标记数据的高度依赖使其在模型优化过程中昂贵且耗时。为了减少对人类注释的过度依赖，我们提出了混音教学，这是一个有效的半监督学习框架，适用于在训练阶段采用标签和未标记的图像。教学首先通过自我训练生成用于未标记图像的伪标记。然后，通过将实例级图像贴片合并到空背景或标记的图像中，对学生模型进行了更密集和精确的标签训练。这是第一个打破图像级限制并将高质量的伪标签从多帧放入一个图像进行半监督训练的图像。此外，由于置信度评分和本地化质量之间的不对对准，很难仅使用基于置信度的标准将高质量的伪标签与嘈杂的预测区分开。为此，我们进一步引入了一个基于不确定性的过滤器，以帮助选择可靠的伪框来进行上述混合操作。据我们所知，这是单眼3D对象检测的第一个统一SSL框架。在Kitti数据集上的各种标签比下，混合教学始终通过明显的边缘来改善单支持者和Gupnet。例如，我们的方法仅使用10％标记的数据时，在验证集上对GUPNET基线的改进约为 +6.34％[email protected]。此外，通过利用完整的训练集和Kitti的另外48K原始图像，它可以在[email protected]上进一步提高 +4.65％的汽车检测，达到18.54％[email protected]，在Kitti测试测试排行榜上排名第一个基于单一目的的方法。代码和预估计的模型将在https://github.com/yanglei18/mix-teaching上发布。

Monocular 3D object detection is an essential perception task for autonomous driving. However, the high reliance on large-scale labeled data make it costly and time-consuming during model optimization. To reduce such over-reliance on human annotations, we propose Mix-Teaching, an effective semi-supervised learning framework applicable to employ both labeled and unlabeled images in training stage. Mix-Teaching first generates pseudo-labels for unlabeled images by self-training. The student model is then trained on the mixed images possessing much more intensive and precise labeling by merging instance-level image patches into empty backgrounds or labeled images. This is the first to break the image-level limitation and put high-quality pseudo labels from multi frames into one image for semi-supervised training. Besides, as a result of the misalignment between confidence score and localization quality, it's hard to discriminate high-quality pseudo-labels from noisy predictions using only confidence-based criterion. To that end, we further introduce an uncertainty-based filter to help select reliable pseudo boxes for the above mixing operation. To the best of our knowledge, this is the first unified SSL framework for monocular 3D object detection. Mix-Teaching consistently improves MonoFlex and GUPNet by significant margins under various labeling ratios on KITTI dataset. For example, our method achieves around +6.34% [email protected] improvement against the GUPNet baseline on validation set when using only 10% labeled data. Besides, by leveraging full training set and the additional 48K raw images of KITTI, it can further improve the MonoFlex by +4.65% improvement on [email protected] for car detection, reaching 18.54% [email protected], which ranks the 1st place among all monocular based methods on KITTI test leaderboard. The code and pretrained models will be released at https://github.com/yanglei18/Mix-Teaching.

下载PDF全文

下载文献需遵守相关版权规定

论文标题