哈尔滨师范大学自然科学学报
哈爾濱師範大學自然科學學報
합이빈사범대학자연과학학보
NATURAL SCIENCES JOURNAL OF HARBIN NORMAL UNIVERSITY
2012年
1期
32-36
,共5页
Hadoop%Map%Reduce%Kmeans
Hadoop%Map%Reduce%Kmeans
Hadoop%Map%Reduce%Kmeans
Hadoop%Map%Reduce%Kmeans
针对聚类算法并行化的需求,该文对基于Hadoop平台Kmeans算法进行了改进,选用Canopy算法对数据进行预处理,并在具有一定数据结构的电影数据集上进行了单机对比实验,集群加速比实验和集群扩展率实验,分别体现改进后算法实现的高效性、良好的加速比和可扩展性,从而可以有效地运用在实际海量数据挖掘中.
針對聚類算法併行化的需求,該文對基于Hadoop平檯Kmeans算法進行瞭改進,選用Canopy算法對數據進行預處理,併在具有一定數據結構的電影數據集上進行瞭單機對比實驗,集群加速比實驗和集群擴展率實驗,分彆體現改進後算法實現的高效性、良好的加速比和可擴展性,從而可以有效地運用在實際海量數據挖掘中.
침대취류산법병행화적수구,해문대기우Hadoop평태Kmeans산법진행료개진,선용Canopy산법대수거진행예처리,병재구유일정수거결구적전영수거집상진행료단궤대비실험,집군가속비실험화집군확전솔실험,분별체현개진후산법실현적고효성、량호적가속비화가확전성,종이가이유효지운용재실제해량수거알굴중.
According to parallelism demand of the clustering algorithm, This paper improved the implemention of the kmeans algorithm based on the Hadoop platform. We do the preprocess on the dataset using the canopy algorithm, and conduct the single contrast experiment, cluster speed up experiment and cluster expansion rate experiment, showing the high effiency, better speed up and scalability, thus the implemention can be used in the pratical mass data mining effectively.