帮助 本站公告
您现在所在的位置:网站首页 > 知识中心 > 文献详情
文献详细Journal detailed

两层聚类的类别不平衡数据挖掘算法
Two-tier Clustering for Mining Imbalanced Datasets

作  者: ; ; ;

机构地区: 佛山科学技术学院

出  处: 《计算机科学》 2013年第11期271-275,共5页

摘  要: 类别不平衡数据分类是机器学习和数据挖掘研究的热点问题。传统分类算法有很大的偏向性,少数类分类效果不够理想。提出一种两层聚类的类别不平衡数据级联挖掘算法。算法首先进行基于聚类的欠采样,在多数类样本上进行聚类,之后提取聚类质心,获得与少数类样本数目相一致的聚类质心,再与所有少数类样例一起组成新的平衡训练集,为了避免少数类样本数量过少而使训练集过小导致分类精度下降的问题,使用SMOTE过采样结合聚类欠采样;然后在平衡的训练集上使用K均值聚类与C4.5决策树算法相级联的分类方法,通过K均值聚类将训练样例划分为K个簇,在每个聚类簇内使用C4.5算法构建决策树,通过K个聚簇上的决策树来改进优化分类决策边界。实验结果表明,该算法具有处理类别不平衡数据分类问题的优势。 Classification of class-imbalanced data becomes a research hot topic in machine learning and data mining. Most classification algorithms tend to predict that most of the incoming data belongs to the majority class, resulting in the pool classification performance in minority class instances, which are usually much more of interest. In this paper, a two-tier clustering cascading mining algorithm was proposed. The algorithm first constructs balanced training set by clusterd-based under-sampling, using K-means clustering to cluster majority class and extract cluster centroids then merge with all minority class instances to generate a balanced training set for training. To avoid the number of the mi- nority is too small, leading the shortage of training instance, combination of SMOTE over-sampling and cluster-based under-sampling is used~ next, using "K-means-t-CA. 5", a method to cascade K-means clustering and CA. 5 decision tree algorithm for classifying on the balanced training set, the K-means clustering method is first used to parition the training instances into k clusters, and on each cluster, CA. 5 algorithm is used to build decision tree, the decision tree on each cluster refines the decision boundaries by learning the subgroups within the cluster. Experimental results show that the proposed method provides better classification performance than other approaches on both minority and majority clas- ses,and is effective and feasible to deal with the imbalanced datasets.

关 键 词: 数据挖掘 分类 不平衡数据 均值聚类

领  域: [自动化与计算机技术] [自动化与计算机技术]

相关作者

作者 王和勇
作者 洪明
作者 朱华鑫
作者 温重伟
作者 张兆民

相关机构对象

机构 华南理工大学
机构 中山大学
机构 暨南大学
机构 华南师范大学
机构 暨南大学管理学院

相关领域作者

作者 李文姬
作者 邵慧君
作者 杜松华
作者 周国林
作者 邢弘昊