计算机技术与发展
計算機技術與髮展
계산궤기술여발전
COMPUTER TECHNOLOGY AND DEVELOPMENT
2014年
7期
52-55,59
,共5页
主题爬虫%地理信息更新%支持向量机%回溯算法
主題爬蟲%地理信息更新%支持嚮量機%迴溯算法
주제파충%지리신식경신%지지향량궤%회소산법
topic-driven web crawler%geographic information updating%support vector machine%backtracking algorithm
互联网的崛起为地理信息更新检索提供了一条新的途径,具有实时性强、成本低的优势。文中从实际出发,针对现有爬虫算法的缺陷,提出一种基于链接回溯的地理信息更新主题爬虫方法。首先,结合支持向量机分类技术,能够快速有效地找出一个网站中最有可能包含主题相关内容的链接方向;然后,回溯到这些链接后继续进行爬取,并通过地理信息变化要素知识库确定主题内容,从而优化爬取路径,减少低效率的爬取过程。实验结果表明,该方法可以找出最有可能包含地理信息的链接方向,大幅提高主题爬取效率,在其他主题方向也具有一定的可推广性。
互聯網的崛起為地理信息更新檢索提供瞭一條新的途徑,具有實時性彊、成本低的優勢。文中從實際齣髮,針對現有爬蟲算法的缺陷,提齣一種基于鏈接迴溯的地理信息更新主題爬蟲方法。首先,結閤支持嚮量機分類技術,能夠快速有效地找齣一箇網站中最有可能包含主題相關內容的鏈接方嚮;然後,迴溯到這些鏈接後繼續進行爬取,併通過地理信息變化要素知識庫確定主題內容,從而優化爬取路徑,減少低效率的爬取過程。實驗結果錶明,該方法可以找齣最有可能包含地理信息的鏈接方嚮,大幅提高主題爬取效率,在其他主題方嚮也具有一定的可推廣性。
호련망적굴기위지리신식경신검색제공료일조신적도경,구유실시성강、성본저적우세。문중종실제출발,침대현유파충산법적결함,제출일충기우련접회소적지리신식경신주제파충방법。수선,결합지지향량궤분류기술,능구쾌속유효지조출일개망참중최유가능포함주제상관내용적련접방향;연후,회소도저사련접후계속진행파취,병통과지리신식변화요소지식고학정주제내용,종이우화파취로경,감소저효솔적파취과정。실험결과표명,해방법가이조출최유가능포함지리신식적련접방향,대폭제고주제파취효솔,재기타주제방향야구유일정적가추엄성。
The rise of Internet makes it a new way to search for information about geographic information updating,which has advantages of low cost and strong real-time. In allusion to the insufficiency of current top-driven web crawler,a new web crawler based on link backtracking algorithm is proposed in view of practice. First,it can find out the link paths in a website which most probably lead to topic information by using support vector machine classification;then,backtrack to these links and restart crawling,the theme of every links will be confirmed by using geographic information changing factor knowledge base,as a result,it will optimize crawling path and reduce low efficient crawling process. According to results from experiments,it can find out paths which lead to wanted information and enhance effi-ciency of crawling process,and also has a good possibility to extend to other topic areas.