计算机工程与设计
計算機工程與設計
계산궤공정여설계
COMPUTER ENGINEERING AND DESIGN
2010年
8期
1799-1801,1805
,共4页
苏晓珂%兰洋%程耀东%万仁霞
囌曉珂%蘭洋%程耀東%萬仁霞
소효가%란양%정요동%만인하
混合属性%增量聚类%差异度量%大规模数据集%约束
混閤屬性%增量聚類%差異度量%大規模數據集%約束
혼합속성%증량취류%차이도량%대규모수거집%약속
mixed attributes%clustering incrementally%dissimilarity meusure%large-scaledatasct%constraint
为解决大规模数据集聚类过程中内存容量受限问题,提出了一种基于聚类个数约束的快速聚类算法,只需扫描一趟原始数据集,半径阈值随聚类过程动态变化;同时定义了一种包含分类属性取值频率信息的类间差异性度量,可用于混合属性数据集,时间复杂度与空间复杂度同数据集大小,属性个数近似成线性关系.在KDDCUP99数据集上的实验结果表明,提出的算法输入参数少,具有良好的聚类特性,可用于大规模数据集.
為解決大規模數據集聚類過程中內存容量受限問題,提齣瞭一種基于聚類箇數約束的快速聚類算法,隻需掃描一趟原始數據集,半徑閾值隨聚類過程動態變化;同時定義瞭一種包含分類屬性取值頻率信息的類間差異性度量,可用于混閤屬性數據集,時間複雜度與空間複雜度同數據集大小,屬性箇數近似成線性關繫.在KDDCUP99數據集上的實驗結果錶明,提齣的算法輸入參數少,具有良好的聚類特性,可用于大規模數據集.
위해결대규모수거집취류과정중내존용량수한문제,제출료일충기우취류개수약속적쾌속취류산법,지수소묘일쟁원시수거집,반경역치수취류과정동태변화;동시정의료일충포함분류속성취치빈솔신식적류간차이성도량,가용우혼합속성수거집,시간복잡도여공간복잡도동수거집대소,속성개수근사성선성관계.재KDDCUP99수거집상적실험결과표명,제출적산법수입삼수소,구유량호적취류특성,가용우대규모수거집.
To solve the constraint of the memory capacity during clustering the large-scale dataset,a fast clustering algorithm based on the constraint of the number of clusters is put forward.The original dataset is read only once and the radius threshold changes dynamically.At the same time an inter-cluster dissimilarity measure taking into account the frequency information of the categorical attribute values is introduced,which can be used for the mixed dataset.The time complexity and space complexity are nearly linear with the size of dataset and the number of attributes.The experimental results on the KDDCUP99 dataset show that the proposed algorithm is feasible and effective,which can be used for the large-scale dataset.