北京工业大学学报
北京工業大學學報
북경공업대학학보
JOURNAL OF BEIJING POLYTECHNIC UNIVERSITY
2015年
7期
1012-1019
,共8页
马志强%张泽广%闫瑞%杨双涛
馬誌彊%張澤廣%閆瑞%楊雙濤
마지강%장택엄%염서%양쌍도
主题爬虫%主题团模型%相关度计算%隧道%N-Gram模型
主題爬蟲%主題糰模型%相關度計算%隧道%N-Gram模型
주제파충%주제단모형%상관도계산%수도%N-Gram모형
focused crawler%topic group model%calculation of correlation%tunnel%N-Gram model
针对蒙古文主题爬虫主要面临的预测采集URL和发现隧道2个核心问题,提出一种基于主题团的站点聚类、排序和隧道发现的采集模型。通过站点的主题识别,将待爬行URL分为站点链接和非站点链接,使用文本相似度和超链图分析建立了预测URL优先级排序算法,基于站点粒度设计了站点自适应隧道发现算法,最后,构建了一个面向蒙古文主题的网络爬虫系统。实验结果表明:该算法在采准率、信息总量与采集速率上都得到了提高,明显优于基线算法。
針對矇古文主題爬蟲主要麵臨的預測採集URL和髮現隧道2箇覈心問題,提齣一種基于主題糰的站點聚類、排序和隧道髮現的採集模型。通過站點的主題識彆,將待爬行URL分為站點鏈接和非站點鏈接,使用文本相似度和超鏈圖分析建立瞭預測URL優先級排序算法,基于站點粒度設計瞭站點自適應隧道髮現算法,最後,構建瞭一箇麵嚮矇古文主題的網絡爬蟲繫統。實驗結果錶明:該算法在採準率、信息總量與採集速率上都得到瞭提高,明顯優于基線算法。
침대몽고문주제파충주요면림적예측채집URL화발현수도2개핵심문제,제출일충기우주제단적참점취류、배서화수도발현적채집모형。통과참점적주제식별,장대파행URL분위참점련접화비참점련접,사용문본상사도화초련도분석건립료예측URL우선급배서산법,기우참점립도설계료참점자괄응수도발현산법,최후,구건료일개면향몽고문주제적망락파충계통。실험결과표명:해산법재채준솔、신식총량여채집속솔상도득도료제고,명현우우기선산법。
Forecast of collecting URL and tunnel discovery are two core issues in Focused crawler for Mongolian website. Therefore, a collecting model was proposed based on topic group of site clustering, ordering and tunnel discovery. First, through the topic identification text, to be crawling URL was divided into the site links and non site links. Second, a URL priority ordering algorithm was established by using the text similarity and the hyperlink graph analysis, and an adaptive tunnel discovery algorithm based on website was designed. Finally, the system of focused crawler for the Mongolian website was constructed. The experimental results show that the accurate rate of collecting, the amount of information and the collection rate have been improved significantly compared than the baseline algorithm.