计算机工程与应用
計算機工程與應用
계산궤공정여응용
COMPUTER ENGINEERING AND APPLICATIONS
2013年
21期
172-176,185
,共6页
吴磊%房斌%刁丽萍%陈静%谢娜娜
吳磊%房斌%刁麗萍%陳靜%謝娜娜
오뢰%방빈%조려평%진정%사나나
不平衡数据%重抽样%基于聚类的过抽样算法(CBOS)%基于边界值的虚拟少数类向上采样算法(BSM)%可选择最近邻算法(ENN)%Tomek links%预处理
不平衡數據%重抽樣%基于聚類的過抽樣算法(CBOS)%基于邊界值的虛擬少數類嚮上採樣算法(BSM)%可選擇最近鄰算法(ENN)%Tomek links%預處理
불평형수거%중추양%기우취류적과추양산법(CBOS)%기우변계치적허의소수류향상채양산법(BSM)%가선택최근린산법(ENN)%Tomek links%예처리
imbalanced datasets%resampling%Cluster-Based Oversampling(CBOS)%Borderline Synthetic Minority Over-sampling Technique(BSM)%Edited Nearest Neighbor(ENN)%Tomek links%preprocess
在机器学习领域的研究当中,分类器的性能会受到许多方面的影响,其中训练数据的不平衡对分类器的影响尤为严重。训练数据的不平衡也就是指在提供的训练数据集中,一类的样本总数远多于另一类的样本总数。常用的不平衡数据的处理方法有很多,只探讨利用重抽样方法对不平衡数据进行预处理来提高分类效果的方法。数据抽样算法有很多,但可以归为两大类:过抽样和欠抽样。针对二分类问题提出了四种融合过抽样和欠抽样算法的重抽样方法:BSM+Tomek、BSM+ENN、CBOS+Tomek和CBOS+ENN,并且与另外十种经典的重抽样算法做了大量的对比实验,实验证明提出的四种预处理算法在多种评价指标下提高了不平衡数据的分类效果。
在機器學習領域的研究噹中,分類器的性能會受到許多方麵的影響,其中訓練數據的不平衡對分類器的影響尤為嚴重。訓練數據的不平衡也就是指在提供的訓練數據集中,一類的樣本總數遠多于另一類的樣本總數。常用的不平衡數據的處理方法有很多,隻探討利用重抽樣方法對不平衡數據進行預處理來提高分類效果的方法。數據抽樣算法有很多,但可以歸為兩大類:過抽樣和欠抽樣。針對二分類問題提齣瞭四種融閤過抽樣和欠抽樣算法的重抽樣方法:BSM+Tomek、BSM+ENN、CBOS+Tomek和CBOS+ENN,併且與另外十種經典的重抽樣算法做瞭大量的對比實驗,實驗證明提齣的四種預處理算法在多種評價指標下提高瞭不平衡數據的分類效果。
재궤기학습영역적연구당중,분류기적성능회수도허다방면적영향,기중훈련수거적불평형대분류기적영향우위엄중。훈련수거적불평형야취시지재제공적훈련수거집중,일류적양본총수원다우령일류적양본총수。상용적불평형수거적처리방법유흔다,지탐토이용중추양방법대불평형수거진행예처리래제고분류효과적방법。수거추양산법유흔다,단가이귀위량대류:과추양화흠추양。침대이분류문제제출료사충융합과추양화흠추양산법적중추양방법:BSM+Tomek、BSM+ENN、CBOS+Tomek화CBOS+ENN,병차여령외십충경전적중추양산법주료대량적대비실험,실험증명제출적사충예처리산법재다충평개지표하제고료불평형수거적분류효과。
There are several aspects that might influence the performance achieved by existing learning systems in the area of machine learning. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. Though there are several kinds of methods to get rid of this problem, this paper only discusses using resampling method to balance data in the period of preprocessing to improve the effect of classification. There are two kinds of resampling methods:over resampling and under resampling. In this paper, four methods which combine oversampling and under-sampling method are proposed for binary classification:BSM+Tomek, BSM+ENN, CBOS+Tomek and CBOS+ENN, and present very good results for data sets with a small number of positive examples. Moreover, ten other resampling methods are also taken to make comparative experiments with the four methods proposed by this paper, and the four methods also present very good results.