计算机工程与设计
計算機工程與設計
계산궤공정여설계
COMPUTER ENGINEERING AND DESIGN
2015年
6期
1630-1636
,共7页
哈希算法%优先级队列%均衡%多主题%任务调度
哈希算法%優先級隊列%均衡%多主題%任務調度
합희산법%우선급대렬%균형%다주제%임무조도
hashing algorithm%priority queue%balance%multi-topic%task scheduling
在分布式环境中,为提高资源利用率和网页抓取效率,提出一种基于优先级队列的分布式多主题爬虫调度算法PQ‐MCSA。利用基于缓存的扩展式哈希算法对整体任务集进行切割,按照URL逻辑二级节点哈希映射法,将分割后的子任务集均匀地分配到各处理节点中;利用单处理节点的计算能力结合构建的任务优先级队列进行不同主题任务的调度。该算法改善了传统分布式爬虫对单节点的处理资源调度不充分、多主题任务爬取不均匀等缺点。实际项目的应用结果表明,使用该方法能够有效地提高各主题爬取结果的均衡度,具有较强的实用性。
在分佈式環境中,為提高資源利用率和網頁抓取效率,提齣一種基于優先級隊列的分佈式多主題爬蟲調度算法PQ‐MCSA。利用基于緩存的擴展式哈希算法對整體任務集進行切割,按照URL邏輯二級節點哈希映射法,將分割後的子任務集均勻地分配到各處理節點中;利用單處理節點的計算能力結閤構建的任務優先級隊列進行不同主題任務的調度。該算法改善瞭傳統分佈式爬蟲對單節點的處理資源調度不充分、多主題任務爬取不均勻等缺點。實際項目的應用結果錶明,使用該方法能夠有效地提高各主題爬取結果的均衡度,具有較彊的實用性。
재분포식배경중,위제고자원이용솔화망혈조취효솔,제출일충기우우선급대렬적분포식다주제파충조도산법PQ‐MCSA。이용기우완존적확전식합희산법대정체임무집진행절할,안조URL라집이급절점합희영사법,장분할후적자임무집균균지분배도각처리절점중;이용단처리절점적계산능력결합구건적임무우선급대렬진행불동주제임무적조도。해산법개선료전통분포식파충대단절점적처리자원조도불충분、다주제임무파취불균균등결점。실제항목적응용결과표명,사용해방법능구유효지제고각주제파취결과적균형도,구유교강적실용성。
To improve resource utilization and efficiency of Web crawling in a distributed environment ,a priority queue based scheduling algorithm for distributed multi‐topic crawler named PQ‐MCSA was proposed .The cache‐based extendible hashing al‐gorithm was utilized to cut the task set and then the divided tasks were distributed to each processing node evenly according to the URL secondary logical node hashing algorithm .The computing ability of single processing node and task priority queue previous‐ly constructed were combined to schedule different topic tasks .This algorithm overcomes shortcomings of the traditional distri‐buted crawler such as insufficient scheduling of single node resource and uneven crawling of multi‐topic tasks .The application in an actual project shows that this method can improve the balance degree of topic crawl results efficiently ,and has stronger practi‐cal value .