论文标题

一种通过基于几何的信息抽样和类优先合成数据生成(GICAPS)来处理多类不平衡数据的方法

A Method for Handling Multi-class Imbalanced Data by Geometry based Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS)

论文作者

Majumder, Anima, Dutta, Samrat, Kumar, Swagat, Behera, Laxmidhar

论文摘要

本文研究了在多标签分类问题中处理不平衡数据的问题。通过提出两种主要利用特征向量之间几何关系的新方法来解决问题。第一个是一种底采样算法,该算法使用特征向量之间使用角度来选择更有信息的样本,同时拒绝较不含糊的样本。提出了一个合适的标准来定义给定样本的信息。第二个是使用生成算法来创建尊重所有类边界的新合成数据的过采样算法。这是通过基于特征向量之间的欧几里得距离找到\ emph {no Man's Land}来实现的。通过解决基于高斯人的混合物解决通用的多类识别问题,可以分析提出方法的功效。提出的算法的优越性是通过与其他最先进的方法(包括Smote和Adasyn)进行比较建立的,这些方法比表现出高度超级数据不平衡的十个不同的公开可用数据集。将这两种方法合并为单个数据处理框架,并将其标记为``GICAPS'',以突出基于几何信息(GI)采样和类优先级的合成(CAP)在处理多类数据不平衡问题中的作用,从而在此领域中做出了新的作用。

This paper looks into the problem of handling imbalanced data in a multi-label classification problem. The problem is solved by proposing two novel methods that primarily exploit the geometric relationship between the feature vectors. The first one is an undersampling algorithm that uses angle between feature vectors to select more informative samples while rejecting the less informative ones. A suitable criterion is proposed to define the informativeness of a given sample. The second one is an oversampling algorithm that uses a generative algorithm to create new synthetic data that respects all class boundaries. This is achieved by finding \emph{no man's land} based on Euclidean distance between the feature vectors. The efficacy of the proposed methods is analyzed by solving a generic multi-class recognition problem based on mixture of Gaussians. The superiority of the proposed algorithms is established through comparison with other state-of-the-art methods, including SMOTE and ADASYN, over ten different publicly available datasets exhibiting high-to-extreme data imbalance. These two methods are combined into a single data processing framework and is labeled as ``GICaPS'' to highlight the role of geometry-based information (GI) sampling and Class-Prioritized Synthesis (CaPS) in dealing with multi-class data imbalance problem, thereby making a novel contribution in this field.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源