基于LSH的高不平衡数据集中的大数据的实例选择算法

论文标题

基于LSH的高不平衡数据集中的大数据的实例选择算法

An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH

论文作者

Melo-Acosta, Germán E., Duitama-Muñoz, Freddy, Arias-Londoño, Julián D.

论文摘要

在真实环境中对机器学习（ML）模型的培训通常处理大数据集和高级失衡样本，在这些样本中，兴趣类是无人代表的（少数族裔类）。使用经典ML模型的实用解决方案使用培训算法的并行/分布式实现解决了大数据集的问题，基于模型的解决方案或应用实例选择（IS）算法来消除冗余信息。但是，大和高失衡数据集的总问题的解决方案较少。这项工作提出了三种新方法，即能够处理大型且不平衡的数据集。所提出的方法将局部敏感的散列（LSH）用作基础聚类技术，然后将三种不同的采样方法应用于LSH生成的簇（或桶）的顶部。这些算法是在Apache Spark框架中开发的，可以保证其可扩展性。在三个不同数据集中进行的实验表明，提出的方法是方法可以在几何平均值方面提高基本ML模型在5％至19％之间的性能。

Training of Machine Learning (ML) models in real contexts often deals with big data sets and high-class imbalance samples where the class of interest is unrepresented (minority class). Practical solutions using classical ML models address the problem of large data sets using parallel/distributed implementations of training algorithms, approximate model-based solutions, or applying instance selection (IS) algorithms to eliminate redundant information. However, the combined problem of big and high imbalanced datasets has been less addressed. This work proposes three new methods for IS to be able to deal with large and imbalanced data sets. The proposed methods use Locality Sensitive Hashing (LSH) as a base clustering technique, and then three different sampling methods are applied on top of the clusters (or buckets) generated by LSH. The algorithms were developed in the Apache Spark framework, guaranteeing their scalability. The experiments carried out in three different datasets suggest that the proposed IS methods can improve the performance of a base ML model between 5% and 19% in terms of the geometric mean.

下载PDF全文

下载文献需遵守相关版权规定

论文标题