计算机技术与发展
計算機技術與髮展
계산궤기술여발전
COMPUTER TECHNOLOGY AND DEVELOPMENT
2014年
1期
22-25,30
,共5页
吕婉琪%钟诚%唐印浒%陈志朕
呂婉琪%鐘誠%唐印滸%陳誌朕
려완기%종성%당인호%진지짐
数据挖掘%大数据集%并行算法%Hadoop
數據挖掘%大數據集%併行算法%Hadoop
수거알굴%대수거집%병행산법%Hadoop
data mining%large dataset%parallel algorithm%Hadoop
基于Hadoop分布式计算平台,给出一种适用于大数据集的并行挖掘算法。该算法对非结构化的原始大数据集以及中间结果文件进行垂直划分以确保能够获得完整的频繁项集,将各个垂直分块数据分配给不同的Hadoop计算节点进行处理,以减少各个计算节点的存储数据,进而减少各个计算节点执行交集操作的次数,提高并行挖掘效率。实验结果表明,给出的并行挖掘算法解决了大数据集挖掘过程中产生的大量数据通信、中间数据以及执行大量交集操作的问题,算法高效、可扩展。
基于Hadoop分佈式計算平檯,給齣一種適用于大數據集的併行挖掘算法。該算法對非結構化的原始大數據集以及中間結果文件進行垂直劃分以確保能夠穫得完整的頻繁項集,將各箇垂直分塊數據分配給不同的Hadoop計算節點進行處理,以減少各箇計算節點的存儲數據,進而減少各箇計算節點執行交集操作的次數,提高併行挖掘效率。實驗結果錶明,給齣的併行挖掘算法解決瞭大數據集挖掘過程中產生的大量數據通信、中間數據以及執行大量交集操作的問題,算法高效、可擴展。
기우Hadoop분포식계산평태,급출일충괄용우대수거집적병행알굴산법。해산법대비결구화적원시대수거집이급중간결과문건진행수직화분이학보능구획득완정적빈번항집,장각개수직분괴수거분배급불동적Hadoop계산절점진행처리,이감소각개계산절점적존저수거,진이감소각개계산절점집행교집조작적차수,제고병행알굴효솔。실험결과표명,급출적병행알굴산법해결료대수거집알굴과정중산생적대량수거통신、중간수거이급집행대량교집조작적문제,산법고효、가확전。
Based on Hadoop distributed computing framework,propose a parallel algorithm for mining the large dataset. The presented al-gorithm divides the original large non-structured dataset and large middle result files into several smaller-scale data blocks by vertical partitioning pattern in order to ensure the completeness of the frequent item set. The algorithm can reduce the size of the data to be stored in each computing node and decrease the execution times that each computing node calculates the intersection operations by distributing the data blocks to the computing nodes to parallel mining in Hadoop distributed computing environment,and it can improve the efficiency of parallel mining. The experimental results show that the presented parallel mining algorithm can solve the problem that the mining large dataset will generate large amount of data communication and large number of operations for calculating intersection,and it is efficient and scalable.