计算机科学与探索
計算機科學與探索
계산궤과학여탐색
JOURNAL OF FRONTIERS OF COMPUTER SCIENCE & TECHNOLOGY
2014年
12期
1422-1431
,共10页
刘恒%寇月%申德荣%王泰明%于戈
劉恆%寇月%申德榮%王泰明%于戈
류항%구월%신덕영%왕태명%우과
分布式SimRank%随机游走路径%BSP模型%MapReduce
分佈式SimRank%隨機遊走路徑%BSP模型%MapReduce
분포식SimRank%수궤유주로경%BSP모형%MapReduce
distributed SimRank%random walk path%BSP model%MapReduce
SimRank算法是一种常用的相似性度量模型,它基于图的拓扑结构信息来衡量任意两个对象之间的相似程度。随着数据规模的不断增大,集中式SimRank算法已不适用,而已有的分布式SimRank算法在运行效率和扩展性等方面存在缺陷。针对上述问题,提出了一种两阶段的基于随机游走路径的分布式SimRank算法。第一阶段基于BSP(bulk synchronous parallel)模型建立随机游走路径索引信息,支持新路径的动态添加,并通过阈值过滤尽可能减少生成路径的数量;第二阶段利用第一阶段生成的索引信息,提出了基于MapReduce的分布式SimRank算法。最后,通过实验验证了算法的可行性和有效性。
SimRank算法是一種常用的相似性度量模型,它基于圖的拓撲結構信息來衡量任意兩箇對象之間的相似程度。隨著數據規模的不斷增大,集中式SimRank算法已不適用,而已有的分佈式SimRank算法在運行效率和擴展性等方麵存在缺陷。針對上述問題,提齣瞭一種兩階段的基于隨機遊走路徑的分佈式SimRank算法。第一階段基于BSP(bulk synchronous parallel)模型建立隨機遊走路徑索引信息,支持新路徑的動態添加,併通過閾值過濾儘可能減少生成路徑的數量;第二階段利用第一階段生成的索引信息,提齣瞭基于MapReduce的分佈式SimRank算法。最後,通過實驗驗證瞭算法的可行性和有效性。
SimRank산법시일충상용적상사성도량모형,타기우도적탁복결구신식래형량임의량개대상지간적상사정도。수착수거규모적불단증대,집중식SimRank산법이불괄용,이이유적분포식SimRank산법재운행효솔화확전성등방면존재결함。침대상술문제,제출료일충량계단적기우수궤유주로경적분포식SimRank산법。제일계단기우BSP(bulk synchronous parallel)모형건립수궤유주로경색인신식,지지신로경적동태첨가,병통과역치과려진가능감소생성로경적수량;제이계단이용제일계단생성적색인신식,제출료기우MapReduce적분포식SimRank산법。최후,통과실험험증료산법적가행성화유효성。
SimRank is a widely used model for computing similarity, it measures similarity between objects based on graph topology. With the rapid increase of data, the way of centralized SimRank is not applicable and current distributed SimRank approaches have some drawbacks in efficiency and scalability. This paper presents a two-stage distributed SimRank algorithm based on random walk path. The first stage is data preprocessing and a Find-K-Paths algorithm based on BSP (bulk synchronous parallel) model is proposed. The algorithm can effectively build the index information of random walk path and support the dynamic adding of new paths. The number of the generated paths can be reduced by threshold filtering. In the second stage, based on the index information, a distributed SimRank algorithm is proposed under MapReduce. The experiments demonstrate the feasibility and effectiveness of the proposed algorithm.