CAJ | 학술논문

图数据划分是基于BSP（bulk synchronous parallel）编程模型的大规模图处理系统中一个关键技术问题。传统的图划分技术需要多次迭代，时间复杂度过高，且划分结果不具有图顶点到分区的映射信息，因此这些算法并不适用于BSP模型下的数据划分。提出了一种新的面向BSP模型的负载均衡Hash数据划分算法（balanced Hash partition，BHP）。为了实现各个分区的出边数尽可能均衡，该算法引入了虚拟桶的概念，通过贪婪算法将虚拟桶重组为实际分区，保证了每个实际分区负载均衡，同时数据本地化策略使本分片上的数据尽可能地保留在本节点上，从而减小在数据加载时的数据迁移开销。从三个方面对比了BHP算法和经典Hash算法的性能，结果表明BHP算法能够提高作业的执行效率，减少消息发送的数量，有效解决了经典Hash算法的负载不均衡和分区间交互边过多的问题，当数据量变大时，效果尤为明显。
도수거화분시기우BSP（bulk synchronous parallel）편정모형적대규모도처리계통중일개관건기술문제。전통적도화분기술수요다차질대，시간복잡도과고，차화분결과불구유도정점도분구적영사신식，인차저사산법병불괄용우BSP모형하적수거화분。제출료일충신적면향BSP모형적부재균형Hash수거화분산법（balanced Hash partition，BHP）。위료실현각개분구적출변수진가능균형，해산법인입료허의통적개념，통과탐람산법장허의통중조위실제분구，보증료매개실제분구부재균형，동시수거본지화책략사본분편상적수거진가능지보류재본절점상，종이감소재수거가재시적수거천이개소。종삼개방면대비료BHP산법화경전Hash산법적성능，결과표명BHP산법능구제고작업적집행효솔，감소소식발송적수량，유효해결료경전Hash산법적부재불균형화분구간교호변과다적문제，당수거량변대시，효과우위명현。
Graph data partition is a critical technical problem in the large-scale graph processing system based on bulk synchronous parallel (BSP) model. Traditional graph partition technology requires many iterations, the time com-plexity is too high, and the partition results don’t have the mapping information from vertexes to partitions. So those algorithms are not suitable for partitioning graph data for the BSP model based systems. This paper presents a new Hash partition algorithm that implements load balance, called balanced Hash partition (BHP). In order to achieve the balance of the number of out degrees in each partition as much as possible, the concept of virtual bucket is introduced. Then these virtual buckets are reorganized into desired partitions by using a greedy algorithm. In this way, the algorithm can ensure the load balance of each partition. At the same time, the data localization strategy makes the data on the split locate on the corresponding node as much as possible. So, the overhead of data migration during the data loading can be reduced. This paper does some experiments to compare the performance between BHP and the classic Hash algorithm from multiple aspects. The results show that the BHP algorithm can improve the job executing efficiency,reduce the number of sending messages, and effectively solve the problems of load imbalance and too many interac-tive edges among partitions.