计算机工程与应用
計算機工程與應用
계산궤공정여응용
COMPUTER ENGINEERING AND APPLICATIONS
2014年
11期
120-125,138
,共7页
谱聚类%不平衡数据集%过抽样
譜聚類%不平衡數據集%過抽樣
보취류%불평형수거집%과추양
spectral clustering%imbalanced dataset%oversampling
不平衡数据分类问题是数据挖掘领域的关键挑战之一。过抽样方法是解决不平衡分类问题的一种有效手段。传统过抽样方法没有考虑类内不平衡,为此提出基于改进谱聚类的过抽样方法。该方法首先自动确定聚类簇数,并对少数类样本进行谱聚类,再根据各类内包含样本数与总少数类样本数之比,确定在类内合成的样本数量,最后通过在类内进行过抽样,获得平衡的新数据集。在4个实际数据集上验证了算法的有效性。并在二维合成数据集上对比k均值聚类和改进谱聚类的结果,解释基于两种不同聚类的过抽样算法性能差异的原因。
不平衡數據分類問題是數據挖掘領域的關鍵挑戰之一。過抽樣方法是解決不平衡分類問題的一種有效手段。傳統過抽樣方法沒有攷慮類內不平衡,為此提齣基于改進譜聚類的過抽樣方法。該方法首先自動確定聚類簇數,併對少數類樣本進行譜聚類,再根據各類內包含樣本數與總少數類樣本數之比,確定在類內閤成的樣本數量,最後通過在類內進行過抽樣,穫得平衡的新數據集。在4箇實際數據集上驗證瞭算法的有效性。併在二維閤成數據集上對比k均值聚類和改進譜聚類的結果,解釋基于兩種不同聚類的過抽樣算法性能差異的原因。
불평형수거분류문제시수거알굴영역적관건도전지일。과추양방법시해결불평형분류문제적일충유효수단。전통과추양방법몰유고필류내불평형,위차제출기우개진보취류적과추양방법。해방법수선자동학정취류족수,병대소수류양본진행보취류,재근거각류내포함양본수여총소수류양본수지비,학정재류내합성적양본수량,최후통과재류내진행과추양,획득평형적신수거집。재4개실제수거집상험증료산법적유효성。병재이유합성수거집상대비k균치취류화개진보취류적결과,해석기우량충불동취류적과추양산법성능차이적원인。
Imbalanced datasets are one of the most crucial challenges encountered by data mining techniques. Oversam-pling has been proven to be a very effective method in dealing with imbalanced datasets. However, traditional oversam-pling methods pay no attention to within class imbalance which is pervasive in real world datasets. To resolve this prob-lem, this paper proposes an oversampling method based on modified spectral clustering. This method first automatically decides the best number of clusters. Then modified spectral clustering is applied to minority samples. Based on the num-ber of samples contained in each cluster, this proposal judges the number of samples which shall be generated inside each cluster to get a dataset which is balanced both between and within class. This method is tested in 4 real world datasets and one simulated dataset. It is proven to be effective. Moreover, a comparison between traditional k-means clustering based oversampling and the method proposed in this paper is conducted. The results are analyzed and explained.