现代电子技术
現代電子技術
현대전자기술
MODERN ELECTRONICS TECHNIQUE
2014年
8期
19-21
,共3页
k-means算法%信息熵%最优样本抽取%有效性指标
k-means算法%信息熵%最優樣本抽取%有效性指標
k-means산법%신식적%최우양본추취%유효성지표
k-means algorithm%information entropy%optimal sample extraction%validity index
为了解决传统k-means算法需要输入k值和在超大规模数据集进行聚类的问题,这里在前人研究基础上,首先在计算距离时引入信息熵,在超大规模数据集采用数据抽样,抽取最优样本数个样本进行聚类,在抽样数据聚类的基础上进行有效性指标的验证,并且获得算法所需要的k值,然后利用引入信息熵的距离公式再在超大数据集上进行聚类。实验表明,该算法解决了传统k-means算法输入k值的缺陷,通过数据抽样在不影响数据聚类质量的前题下自动获取超大数据集聚类的k值。
為瞭解決傳統k-means算法需要輸入k值和在超大規模數據集進行聚類的問題,這裏在前人研究基礎上,首先在計算距離時引入信息熵,在超大規模數據集採用數據抽樣,抽取最優樣本數箇樣本進行聚類,在抽樣數據聚類的基礎上進行有效性指標的驗證,併且穫得算法所需要的k值,然後利用引入信息熵的距離公式再在超大數據集上進行聚類。實驗錶明,該算法解決瞭傳統k-means算法輸入k值的缺陷,通過數據抽樣在不影響數據聚類質量的前題下自動穫取超大數據集聚類的k值。
위료해결전통k-means산법수요수입k치화재초대규모수거집진행취류적문제,저리재전인연구기출상,수선재계산거리시인입신식적,재초대규모수거집채용수거추양,추취최우양본수개양본진행취류,재추양수거취류적기출상진행유효성지표적험증,병차획득산법소수요적k치,연후이용인입신식적적거리공식재재초대수거집상진행취류。실험표명,해산법해결료전통k-means산법수입k치적결함,통과수거추양재불영향수거취류질량적전제하자동획취초대수거집취류적k치。
In order to solve the problems of the traditional k-means algorithm in which k values needs to be input and the the ultra-large-scale data set needs to be clustered,on the basis of previous studies,the information entropy is brought in when distance is calculated,and data sampling method is adopted,that is,the optimal samples are extracted from the ultra-large-scale data set to conduct sample clustering. Based on the sample data clustering,the validity indexes are verified and k value re-quired by the algorithm is obtained. The distance formula for information entropy is brought in to carry out clustering on the ultra-large data set. Experiments show that the algorithm can overcome the defects of traditional k-means algorithm for k value input, and can automatically obtain k values of ultra-large data clustering under the premise of not affecting the quality of the early da-ta clustering.