计算机与现代化
計算機與現代化
계산궤여현대화
Computer and Modernization
2015年
9期
77-80,89
,共5页
主题爬虫%OTIE算法%Shark-Search算法%隧道穿越
主題爬蟲%OTIE算法%Shark-Search算法%隧道穿越
주제파충%OTIE산법%Shark-Search산법%수도천월
focused crawler%OTIE algorithm%Shark-Search algorithm%tunneling
主题网络爬虫是主题搜索引擎中的一种关键技术,针对OTIE算法参数考虑不全的问题,提出结合链接和网页内容分析的自适应算法,通过结合网页链接重要性和内容相关性得分得到主题网页下载优先级的综合评分,同时考虑在爬取主题网页中的隧道穿越问题。从ODP中选择主题和种子网页,将本算法与Best-First算法、Shark-Search算法和OTIE算法进行比较。实验结果表明,本算法不仅具有明显较好的查全率,而且具有很好的查准率。
主題網絡爬蟲是主題搜索引擎中的一種關鍵技術,針對OTIE算法參數攷慮不全的問題,提齣結閤鏈接和網頁內容分析的自適應算法,通過結閤網頁鏈接重要性和內容相關性得分得到主題網頁下載優先級的綜閤評分,同時攷慮在爬取主題網頁中的隧道穿越問題。從ODP中選擇主題和種子網頁,將本算法與Best-First算法、Shark-Search算法和OTIE算法進行比較。實驗結果錶明,本算法不僅具有明顯較好的查全率,而且具有很好的查準率。
주제망락파충시주제수색인경중적일충관건기술,침대OTIE산법삼수고필불전적문제,제출결합련접화망혈내용분석적자괄응산법,통과결합망혈련접중요성화내용상관성득분득도주제망혈하재우선급적종합평분,동시고필재파취주제망혈중적수도천월문제。종ODP중선택주제화충자망혈,장본산법여Best-First산법、Shark-Search산법화OTIE산법진행비교。실험결과표명,본산법불부구유명현교호적사전솔,이차구유흔호적사준솔。
The focused crawling is a key technique of focus search engine. To solve the problem of incomplete parameters consid-ering in the On-line Topical Importance Estimation ( OTIE) algorithm, this paper proposes an adaptive algorithm that combines link with content analysis to estimate the priority of unvisited URL in the frontier. Moreover, we consider the tunneling problem in the process of topical crawling. We select topics and seed pages from the Open Directory Project ( ODP) and conduct the compar-ative experiments with four crawling algorithms:Best-First, Shark-Search, OTIE and our algorithm. The results of experiment in-dicate that the proposed method improves the performance of focused crawler that significantly outperforms the other three algo-rithms on the average target recall while maintaining an acceptable harvest rate.