计算机科学与探索
計算機科學與探索
계산궤과학여탐색
JOURNAL OF FRONTIERS OF COMPUTER SCIENCE & TECHNOLOGY
2014年
1期
40-50
,共11页
周爽%鲍玉斌%王志刚%冷芳玲%于戈%邓超%郭磊涛
週爽%鮑玉斌%王誌剛%冷芳玲%于戈%鄧超%郭磊濤
주상%포옥빈%왕지강%랭방령%우과%산초%곽뢰도
BSP模型%图划分%分布式系统%负载均衡%虚拟桶
BSP模型%圖劃分%分佈式繫統%負載均衡%虛擬桶
BSP모형%도화분%분포식계통%부재균형%허의통
bulk synchronous parallel (BSP)%graph partition%distributed system%load balance%virtual bucket
图数据划分是基于BSP(bulk synchronous parallel)编程模型的大规模图处理系统中一个关键技术问题。传统的图划分技术需要多次迭代,时间复杂度过高,且划分结果不具有图顶点到分区的映射信息,因此这些算法并不适用于BSP模型下的数据划分。提出了一种新的面向BSP模型的负载均衡Hash数据划分算法(balanced Hash partition,BHP)。为了实现各个分区的出边数尽可能均衡,该算法引入了虚拟桶的概念,通过贪婪算法将虚拟桶重组为实际分区,保证了每个实际分区负载均衡,同时数据本地化策略使本分片上的数据尽可能地保留在本节点上,从而减小在数据加载时的数据迁移开销。从三个方面对比了BHP算法和经典Hash算法的性能,结果表明BHP算法能够提高作业的执行效率,减少消息发送的数量,有效解决了经典Hash算法的负载不均衡和分区间交互边过多的问题,当数据量变大时,效果尤为明显。
圖數據劃分是基于BSP(bulk synchronous parallel)編程模型的大規模圖處理繫統中一箇關鍵技術問題。傳統的圖劃分技術需要多次迭代,時間複雜度過高,且劃分結果不具有圖頂點到分區的映射信息,因此這些算法併不適用于BSP模型下的數據劃分。提齣瞭一種新的麵嚮BSP模型的負載均衡Hash數據劃分算法(balanced Hash partition,BHP)。為瞭實現各箇分區的齣邊數儘可能均衡,該算法引入瞭虛擬桶的概唸,通過貪婪算法將虛擬桶重組為實際分區,保證瞭每箇實際分區負載均衡,同時數據本地化策略使本分片上的數據儘可能地保留在本節點上,從而減小在數據加載時的數據遷移開銷。從三箇方麵對比瞭BHP算法和經典Hash算法的性能,結果錶明BHP算法能夠提高作業的執行效率,減少消息髮送的數量,有效解決瞭經典Hash算法的負載不均衡和分區間交互邊過多的問題,噹數據量變大時,效果尤為明顯。
도수거화분시기우BSP(bulk synchronous parallel)편정모형적대규모도처리계통중일개관건기술문제。전통적도화분기술수요다차질대,시간복잡도과고,차화분결과불구유도정점도분구적영사신식,인차저사산법병불괄용우BSP모형하적수거화분。제출료일충신적면향BSP모형적부재균형Hash수거화분산법(balanced Hash partition,BHP)。위료실현각개분구적출변수진가능균형,해산법인입료허의통적개념,통과탐람산법장허의통중조위실제분구,보증료매개실제분구부재균형,동시수거본지화책략사본분편상적수거진가능지보류재본절점상,종이감소재수거가재시적수거천이개소。종삼개방면대비료BHP산법화경전Hash산법적성능,결과표명BHP산법능구제고작업적집행효솔,감소소식발송적수량,유효해결료경전Hash산법적부재불균형화분구간교호변과다적문제,당수거량변대시,효과우위명현。
Graph data partition is a critical technical problem in the large-scale graph processing system based on bulk synchronous parallel (BSP) model. Traditional graph partition technology requires many iterations, the time com-plexity is too high, and the partition results don’t have the mapping information from vertexes to partitions. So those algorithms are not suitable for partitioning graph data for the BSP model based systems. This paper presents a new Hash partition algorithm that implements load balance, called balanced Hash partition (BHP). In order to achieve the balance of the number of out degrees in each partition as much as possible, the concept of virtual bucket is introduced. Then these virtual buckets are reorganized into desired partitions by using a greedy algorithm. In this way, the algorithm can ensure the load balance of each partition. At the same time, the data localization strategy makes the data on the split locate on the corresponding node as much as possible. So, the overhead of data migration during the data loading can be reduced. This paper does some experiments to compare the performance between BHP and the classic Hash algorithm from multiple aspects. The results show that the BHP algorithm can improve the job executing efficiency,reduce the number of sending messages, and effectively solve the problems of load imbalance and too many interac-tive edges among partitions.