CAJ | 학술논문

针对最大最小值原则的Kmeans聚类算法运行在Hadoop平台时需要多次遍历所有数据的问题，提出了一种改进的初始聚类中心的选择算法称为M＋Kmeans算法。该算法只需要遍历一次全局数据极大的缩减了算法并行运算时消耗的时间。多组实验测试结果显示，设计的M＋Kmeans算法适合运行在大规模集群Hadoop平台上，并且加速比和扩展率较原始算法有明显提高。
침대최대최소치원칙적Kmeans취류산법운행재Hadoop평태시수요다차편력소유수거적문제，제출료일충개진적초시취류중심적선택산법칭위M＋Kmeans산법。해산법지수요편력일차전국수거겁대적축감료산법병행운산시소모적시간。다조실험측시결과현시，설계적M＋Kmeans산법괄합운행재대규모집군Hadoop평태상，병차가속비화확전솔교원시산법유명현제고。
An initial clustering center selection algorithm called M+Kmeans algorithm was presented because the maximum-minimum principle of Kmeans clustering algorithm running on Hadoop platform needs to traverse all data for many times. This algorithm only needs to traverse a global data,thus greatly reducing the time of the algorithm and parallel computing. Multiple sets of experimental test results show that the design of M+Kmeans algorithm is suitable for operation on large Hadoop cluster platform,and the speed ratio can obviously improve the expansion rate than the original algorithm.