北京科技大学学报
北京科技大學學報
북경과기대학학보
JOURNAL OF UNIVERSITY OF SCIENCE AND TECHNOLOGY BEIJING
2014年
10期
1411-1419
,共9页
武森%冯小东%杨杰%张晓楠
武森%馮小東%楊傑%張曉楠
무삼%풍소동%양걸%장효남
云计算%文本%聚类%相似度
雲計算%文本%聚類%相似度
운계산%문본%취류%상사도
cloud computing%documents%clustering%similarity
建立快速有效的针对大规模文本数据的聚类分析方法是当前数据挖掘研究和应用领域中的一个热点问题。为了同时保证聚类效果和提高聚类效率,提出基于“互为最小相似度文本对冶搜索的文本聚类算法及分布式并行计算模型。首先利用向量空间模型提出一种文本相似度计算方法;其次,基于“互为最小相似度文本对冶搜索选择二分簇中心,提出通过一次划分实现簇质心寻优的二分K-means聚类算法;最后,基于MapReduce框架设计面向云计算应用的大规模文本并行聚类模型。在Hadoop平台上运用真实文本数据的实验表明:提出的聚类算法与原始二分K-means相比,在获得相当聚类效果的同时,具有明显效率优势;并行聚类模型在不同数据规模和计算节点数目上具有良好的扩展性。
建立快速有效的針對大規模文本數據的聚類分析方法是噹前數據挖掘研究和應用領域中的一箇熱點問題。為瞭同時保證聚類效果和提高聚類效率,提齣基于“互為最小相似度文本對冶搜索的文本聚類算法及分佈式併行計算模型。首先利用嚮量空間模型提齣一種文本相似度計算方法;其次,基于“互為最小相似度文本對冶搜索選擇二分簇中心,提齣通過一次劃分實現簇質心尋優的二分K-means聚類算法;最後,基于MapReduce框架設計麵嚮雲計算應用的大規模文本併行聚類模型。在Hadoop平檯上運用真實文本數據的實驗錶明:提齣的聚類算法與原始二分K-means相比,在穫得相噹聚類效果的同時,具有明顯效率優勢;併行聚類模型在不同數據規模和計算節點數目上具有良好的擴展性。
건립쾌속유효적침대대규모문본수거적취류분석방법시당전수거알굴연구화응용영역중적일개열점문제。위료동시보증취류효과화제고취류효솔,제출기우“호위최소상사도문본대야수색적문본취류산법급분포식병행계산모형。수선이용향량공간모형제출일충문본상사도계산방법;기차,기우“호위최소상사도문본대야수색선택이분족중심,제출통과일차화분실현족질심심우적이분K-means취류산법;최후,기우MapReduce광가설계면향운계산응용적대규모문본병행취류모형。재Hadoop평태상운용진실문본수거적실험표명:제출적취류산법여원시이분K-means상비,재획득상당취류효과적동시,구유명현효솔우세;병행취류모형재불동수거규모화계산절점수목상구유량호적확전성。
To develop fast and efficient methods to cluster mass document data is one of the hot issues of current data mining research and applications. In order to ensure the clustering result and simultaneously improve the clustering efficiency, a document clustering algorithm was proposed based on searching a document pair with minimum similarity for each other and its distributed parallel computing models were provided. Firstly a document similarity measure was presented using a vector space model (VSM); then bisecting clustering was raised combining the bisecting K-means and the proposed initial cluster center selection approach to find the optimized cluster centroids by once partitioning; finally a distributed parallel document clustering model was designed for cloud computing based on MapReduce framework. Experiments on Hadoop platform, using real document datasets, showed the obvious efficiency advantages of the novel document clustering algorithm compared to the original bisecting K-means with an equivalent clustering result, and the scalability of parallel clustering with different data sizes and different computation node numbers was also evaluated.