计算机研究与发展
計算機研究與髮展
계산궤연구여발전
Journal of Computer Research and Development
2015年
8期
1768-1783
,共16页
严玉良%董一鸿%何贤芒%汪卫
嚴玉良%董一鴻%何賢芒%汪衛
엄옥량%동일홍%하현망%왕위
频繁子图%大规模单图%分布式挖掘%Spark%负载均衡
頻繁子圖%大規模單圖%分佈式挖掘%Spark%負載均衡
빈번자도%대규모단도%분포식알굴%Spark%부재균형
frequent subgraph%single large-scale graph%distribute mining%Spark%workload balance
随着社交网络用户数的快速增加,大规模单图上频繁子图挖掘的需求越来越强烈。单机算法对大规模图的运行效率较低,难以支撑支持度较低的频繁子图的挖掘;现有的分布式环境下单图的频繁子图挖掘算法不支持子图增长模式的挖掘,它们所使用的Hadoop框架也不适合运行迭代式算法。提出了一种基于Spark的大规模单图频繁子图挖掘算法FSMBUS ,通过次优树构建并行计算的候选子图,在给定最小支持度时挖掘出所有的频繁子图,并利用非频繁检测和搜索顺序选择实现优化,还设计了一种名为Sorted‐Greedy的轻量级数据划分方法。实验结果表明,FSMBUS的效率要比现有单图上最新的算法快一个数量级,并支持更低最小支持度阈值以及更大规模图数据的挖掘,同时 FSMBUS 比其Hadoop的移植版要快2~4倍。
隨著社交網絡用戶數的快速增加,大規模單圖上頻繁子圖挖掘的需求越來越彊烈。單機算法對大規模圖的運行效率較低,難以支撐支持度較低的頻繁子圖的挖掘;現有的分佈式環境下單圖的頻繁子圖挖掘算法不支持子圖增長模式的挖掘,它們所使用的Hadoop框架也不適閤運行迭代式算法。提齣瞭一種基于Spark的大規模單圖頻繁子圖挖掘算法FSMBUS ,通過次優樹構建併行計算的候選子圖,在給定最小支持度時挖掘齣所有的頻繁子圖,併利用非頻繁檢測和搜索順序選擇實現優化,還設計瞭一種名為Sorted‐Greedy的輕量級數據劃分方法。實驗結果錶明,FSMBUS的效率要比現有單圖上最新的算法快一箇數量級,併支持更低最小支持度閾值以及更大規模圖數據的挖掘,同時 FSMBUS 比其Hadoop的移植版要快2~4倍。
수착사교망락용호수적쾌속증가,대규모단도상빈번자도알굴적수구월래월강렬。단궤산법대대규모도적운행효솔교저,난이지탱지지도교저적빈번자도적알굴;현유적분포식배경하단도적빈번자도알굴산법불지지자도증장모식적알굴,타문소사용적Hadoop광가야불괄합운행질대식산법。제출료일충기우Spark적대규모단도빈번자도알굴산법FSMBUS ,통과차우수구건병행계산적후선자도,재급정최소지지도시알굴출소유적빈번자도,병이용비빈번검측화수색순서선택실현우화,환설계료일충명위Sorted‐Greedy적경량급수거화분방법。실험결과표명,FSMBUS적효솔요비현유단도상최신적산법쾌일개수량급,병지지경저최소지지도역치이급경대규모도수거적알굴,동시 FSMBUS 비기Hadoop적이식판요쾌2~4배。
Mining frequent subgraphs in a single large‐scale graph is of huge demand with the rapid growth of the social networking .However ,it is inefficient for the serial algorithms to mine frequent subgraphs in low support when mining for a single large‐scale graph . Meanwhile , few existing distributed algorithms can't support the growth pattern mining ,and the Hadoop framework they worked is not suitable for iterative running .In this paper ,a distributed algorithm named FSMBUS for mining frequent subgraph in a single large‐scale graph under Spark framework is proposed . It constructs the parallel computing candidate subgraphs by suboptimal CAM Tree ,which returns all the frequent subgraphs for given user‐defined minimum support .Additionally ,infrequent patterns'test and searching order chosen is introduced to optimize the algorithm .Sorted‐Greedy method is designed for data partition to balance the workload .Our experiments show that FSMBUS runs faster and more effective than the existing algorithms with real datasets ,and even can run with the lower support threshold and the larger graph datasets as well .At the same time ,FSMBUS runs 2~4 times faster on Spark framework than that on Hadoop framework .