使用自动编码器在监督学习中识别标签错误的图像

论文标题

使用自动编码器在监督学习中识别标签错误的图像

Identifying Mislabeled Images in Supervised Learning Utilizing Autoencoder

论文作者

Yang, Yunhao, Whinston, Andrew

论文摘要

监督学习是基于以下假设：训练数据中的基础真相是准确的。但是，这可能无法在现实世界中保证。不准确的培训数据将导致一些意外的预测。在图像分类中，不正确的标签可能导致分类模型也不准确。在本文中，我将在培训分类网络之前将无监督的技术应用于培训数据。卷积自动编码器用于编码和重建图像。编码器将将图像数据投影到潜在空间。在潜在空间中，图像特征保留在较低的维度中。假设具有相似特征的数据样本可能具有相同的标签。可以通过密度基扫描（DBSCAN）聚类算法在潜在空间中分类液体样品。这些错误标记的数据可视化为潜在空间中的离群值。因此，DBSCAN算法确定的离群值可以分类为错误标记的样品。检测到异常值后，所有离群值都被视为标签错误的数据样本并将其从数据集中删除。因此，培训数据可以直接用于培训监督学习网络。该算法可以在实验数据集中检测并删除超过67％的错误标记数据。

Supervised learning is based on the assumption that the ground truth in the training data is accurate. However, this may not be guaranteed in real-world settings. Inaccurate training data will result in some unexpected predictions. In image classification, incorrect labels may cause the classification model to be inaccurate as well. In this paper, I am going to apply unsupervised techniques to the training data before training the classification network. A convolutional autoencoder is applied to encode and reconstruct images. The encoder will project the image data on to latent space. In the latent space, image features are preserved in a lower dimension. The assumption is that data samples with similar features are likely to have the same label. Noised samples can be classified in the latent space by the Density-Base Scan (DBSCAN) clustering algorithm. These incorrectly labeled data are visualized as outliers in the latent space. Therefore, the outliers identified by the DBSCAN algorithm can be classified as incorrectly labeled samples. After the outliers are detected, all the outliers are treated as mislabeled data samples and removed from the dataset. Thus the training data can be directly used in training the supervised learning network. The algorithm can detect and remove above 67\% of mislabeled data in the experimental dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题