计算机技术与发展
計算機技術與髮展
계산궤기술여발전
Computer Technology and Development
2015年
9期
17-21
,共5页
K-means%SMOTE算法%随机森林%不平衡数据集
K-means%SMOTE算法%隨機森林%不平衡數據集
K-means%SMOTE산법%수궤삼림%불평형수거집
K-means%SMOTE algorithm%random forest%imbalance data set
基于SMOTE算法的随机森林能够很好地处理不平衡数据集的分类,是一种通过对数据进行改造以达到良好分类要求的分类器。但SMOTE算法在处理不平衡数据后,可能会导致不平衡数据集分布的整体变化以及模糊正负类边界。这两个缺陷极易导致平衡后的数据与原始数据集有很大差异,从而使分类结果有提高但仍旧不够理想。K-means算法能够有效地聚类,并达到对数据分布的描述。在此基础上,结合K-means算法与SMOTE算法,利用两者优点,文中提出了一种基于K-means的KM-SMOTE算法,有效地解决了上述两个问题。并用于随机森林分类器进行实验,结果表明,改进后的算法分类效果更加明显。
基于SMOTE算法的隨機森林能夠很好地處理不平衡數據集的分類,是一種通過對數據進行改造以達到良好分類要求的分類器。但SMOTE算法在處理不平衡數據後,可能會導緻不平衡數據集分佈的整體變化以及模糊正負類邊界。這兩箇缺陷極易導緻平衡後的數據與原始數據集有很大差異,從而使分類結果有提高但仍舊不夠理想。K-means算法能夠有效地聚類,併達到對數據分佈的描述。在此基礎上,結閤K-means算法與SMOTE算法,利用兩者優點,文中提齣瞭一種基于K-means的KM-SMOTE算法,有效地解決瞭上述兩箇問題。併用于隨機森林分類器進行實驗,結果錶明,改進後的算法分類效果更加明顯。
기우SMOTE산법적수궤삼림능구흔호지처리불평형수거집적분류,시일충통과대수거진행개조이체도량호분류요구적분류기。단SMOTE산법재처리불평형수거후,가능회도치불평형수거집분포적정체변화이급모호정부류변계。저량개결함겁역도치평형후적수거여원시수거집유흔대차이,종이사분류결과유제고단잉구불구이상。K-means산법능구유효지취류,병체도대수거분포적묘술。재차기출상,결합K-means산법여SMOTE산법,이용량자우점,문중제출료일충기우K-means적KM-SMOTE산법,유효지해결료상술량개문제。병용우수궤삼림분류기진행실험,결과표명,개진후적산법분류효과경가명현。
The random forest based on SMOTE algorithm can be a good deal classification in imbalance data,is a classifier through trans-forming the data to achieve good classification requirements. But after SMOTE algorithm deals with imbalance data,may cause overall changes of the distribution of imbalance data sets,and fuzzy the boundaries of positive class and negative class. Both defects can easily lead to big difference from the balanced data sets and the original data sets after the change,resulting in classification results not satisfacto-ry. The K-means clustering algorithm can effectively cluster and describe the data distribution. On this basis,combined with K-means al-gorithm and SMOTE algorithm,using the advantages of both,present a KM-SMOTE algorithm based on K-means algorithm,successful-ly resolving these two issues. And for random forest classifier make an experiment. The results also demonstrate that the effect of the im-proved classification algorithm is more obvious.