计算机工程
計算機工程
계산궤공정
COMPUTER ENGINEERING
2013年
12期
247-250,259
,共5页
数据流%聚类%时态密度%倾斜分布%剪枝%变流速
數據流%聚類%時態密度%傾斜分佈%剪枝%變流速
수거류%취류%시태밀도%경사분포%전지%변류속
data stream%clustering%temporal density%skew distribution%pruning%variable flow rate
处理倾斜分布特征的数据流聚类算法TDCA存在聚类速度与内存利用率上的不足,且变流速的数据流环境对聚类结果的质量有严重影响。针对上述问题,提出一种数据流聚类算法GR-Stream。采用网格单元作为数据点的聚集形式,以基于R-tree的扩展数据结构作为组织网格单元的索引结构,在此基础上引入剪枝策略,并调整数据点进入树的方式。在真实数据集 KDD-CUP99上进行测试,结果表明,与TDCA算法相比,该算法在聚类过程中可以提高40%的访问速度,应用剪枝策略节省至少一半的内存使用量,同时在变流速的数据流环境下将聚类结果的平均纯度保持在90%以上。
處理傾斜分佈特徵的數據流聚類算法TDCA存在聚類速度與內存利用率上的不足,且變流速的數據流環境對聚類結果的質量有嚴重影響。針對上述問題,提齣一種數據流聚類算法GR-Stream。採用網格單元作為數據點的聚集形式,以基于R-tree的擴展數據結構作為組織網格單元的索引結構,在此基礎上引入剪枝策略,併調整數據點進入樹的方式。在真實數據集 KDD-CUP99上進行測試,結果錶明,與TDCA算法相比,該算法在聚類過程中可以提高40%的訪問速度,應用剪枝策略節省至少一半的內存使用量,同時在變流速的數據流環境下將聚類結果的平均純度保持在90%以上。
처리경사분포특정적수거류취류산법TDCA존재취류속도여내존이용솔상적불족,차변류속적수거류배경대취류결과적질량유엄중영향。침대상술문제,제출일충수거류취류산법GR-Stream。채용망격단원작위수거점적취집형식,이기우R-tree적확전수거결구작위조직망격단원적색인결구,재차기출상인입전지책략,병조정수거점진입수적방식。재진실수거집 KDD-CUP99상진행측시,결과표명,여TDCA산법상비,해산법재취류과정중가이제고40%적방문속도,응용전지책략절성지소일반적내존사용량,동시재변류속적수거류배경하장취류결과적평균순도보지재90%이상。
The skew distribution characteristics of data stream clustering algorithm TDCA lack of clustering speed and memory utilization. Variable flow rate data stream environment has a serious impact on the quality of the clustering results. In order to deal with the above problems, a data stream clustering algorithm named GR-Stream is presented. It uses grid cells as the aggregation of data points, Based on an extension of the R-tree structure as the organization of grid cell index structure, it introduces pruning strategy on the basis of this structure, and adjusts the way of data points into the tree. It adopts the real dataset the KDD-CUP99 on algorithm test. Experimental results show that, compared with the TDCA algorithm data structure organizing data, this index structure can improve the clustering speed by 40%, and the application of pruning strategy to save at least half memory usage, at the same time maintaining more than 90%of the average purity of the clustering results in the variable flow rate of the data stream environment.