中文信息学报
中文信息學報
중문신식학보
JOURNAL OF CHINESE INFORMAITON PROCESSING
2010年
3期
62-68
,共7页
计算机应用%中文信息处理%增量搜集%论坛爬虫%延迟
計算機應用%中文信息處理%增量搜集%論罈爬蟲%延遲
계산궤응용%중문신식처리%증량수집%론단파충%연지
computer application%Chinese information processing%incremental crawl%forum crawler%delay
该文研究论坛的增量搜集问题.由于在论坛中同一主题通常分布在多个页面上,而传统增量搜集技术的抓取策略通常是基于单个页面,因此这些技术并不适于对论坛增量搜集.该文通过对许多论坛中版块变化规律的统计分析,提出了基于版块的论坛增量搜集策略.该策略将属于同一版块的所有页面看做一个整体,以它做为抓取的基本单位.同时该策略利用版块权重和局部时间规律确定抓取频率和抓取时间点.实验结果表明本策略对新增和新回复帖子的平均召回率为99.3%,并且与平均调度方法相比系统总延迟最高可减小42%.
該文研究論罈的增量搜集問題.由于在論罈中同一主題通常分佈在多箇頁麵上,而傳統增量搜集技術的抓取策略通常是基于單箇頁麵,因此這些技術併不適于對論罈增量搜集.該文通過對許多論罈中版塊變化規律的統計分析,提齣瞭基于版塊的論罈增量搜集策略.該策略將屬于同一版塊的所有頁麵看做一箇整體,以它做為抓取的基本單位.同時該策略利用版塊權重和跼部時間規律確定抓取頻率和抓取時間點.實驗結果錶明本策略對新增和新迴複帖子的平均召迴率為99.3%,併且與平均調度方法相比繫統總延遲最高可減小42%.
해문연구론단적증량수집문제.유우재론단중동일주제통상분포재다개혈면상,이전통증량수집기술적조취책략통상시기우단개혈면,인차저사기술병불괄우대론단증량수집.해문통과대허다론단중판괴변화규률적통계분석,제출료기우판괴적론단증량수집책략.해책략장속우동일판괴적소유혈면간주일개정체,이타주위조취적기본단위.동시해책략이용판괴권중화국부시간규률학정조취빈솔화조취시간점.실험결과표명본책략대신증화신회복첩자적평균소회솔위99.3%,병차여평균조도방법상비계통총연지최고가감소42%.
This paper studies the problem of incremental crawling of forums. Since a topic in a forum is usually distributed in more than one page and the revisiting strategy of traditional incremental technologies is centered on the individual page, these technologies are not suitable for crawling forum sites incrementally. Based on the statistical analysis on the evolution of board in many Web forums, a novel and board-based incremental crawling strategy is proposed. The main idea of the approach is to define the pages of the same board as the basic unit for re-crawling. In detail, this approach leverages the board weights and local time discipline to allocate crawl resources and determine the crawl time. Experimental results show that the recall for the newly published and updated discussion threads is close to 99. 3% for our method strategy, and the overall system delay is maximally decreased by 42% as compared with even scheduling method.