电子科技大学学报
電子科技大學學報
전자과기대학학보
JOURNAL OF UNIVERSITY OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA
2013年
1期
115-120
,共6页
李华波%吴礼发%赖海光%郑成辉%黄康宇
李華波%吳禮髮%賴海光%鄭成輝%黃康宇
리화파%오례발%뢰해광%정성휘%황강우
Ajax%爬行算法%消重策略%搜索引擎
Ajax%爬行算法%消重策略%搜索引擎
Ajax%파행산법%소중책략%수색인경
Ajax%crawling algorithm%replicas-detecting policy%search engine
Ajax页面的生成和页面导航需要执行客户端的JavaScript代码,传统网络爬行算法无法获取Ajax页面全部内容.分析了Ajax的工作方式,阐述了爬行Ajax网页所面临的主要问题,提出并实现了一种有效爬行Ajax页面的网络爬行算法.该算法可控制客户端浏览器动态生成页面内容和完成页面导航,为爬行过的页面分配标识编号并生成相应静态页面.实验结果表明,提出的算法所爬行的Ajax页面数量明显多于传统方法,同时,采用的双重消重策略可有效减少算法的时间耗费.
Ajax頁麵的生成和頁麵導航需要執行客戶耑的JavaScript代碼,傳統網絡爬行算法無法穫取Ajax頁麵全部內容.分析瞭Ajax的工作方式,闡述瞭爬行Ajax網頁所麵臨的主要問題,提齣併實現瞭一種有效爬行Ajax頁麵的網絡爬行算法.該算法可控製客戶耑瀏覽器動態生成頁麵內容和完成頁麵導航,為爬行過的頁麵分配標識編號併生成相應靜態頁麵.實驗結果錶明,提齣的算法所爬行的Ajax頁麵數量明顯多于傳統方法,同時,採用的雙重消重策略可有效減少算法的時間耗費.
Ajax혈면적생성화혈면도항수요집행객호단적JavaScript대마,전통망락파행산법무법획취Ajax혈면전부내용.분석료Ajax적공작방식,천술료파행Ajax망혈소면림적주요문제,제출병실현료일충유효파행Ajax혈면적망락파행산법.해산법가공제객호단류람기동태생성혈면내용화완성혈면도항,위파행과적혈면분배표식편호병생성상응정태혈면.실험결과표명,제출적산법소파행적Ajax혈면수량명현다우전통방법,동시,채용적쌍중소중책략가유효감소산법적시간모비.
The generation of Ajax web pages and the Ajax page navigation must execute the client JavaScript, thus it is impossible to extract the complete content of an Ajax page through the traditional crawling algorithms. In this paper, the working mode of Ajax is analyzed, the problem of crawling Ajax web pages is elaborated, and an effective algorithm for crawling Ajax pages is proposed. The algorithm can realize the dynamic generation of Ajax web contents in client browser and the navigation of Ajax web pages, and also it can assign identification number for the crawled pages whose static pages can be generated. Experimental result shows that the number of Ajax pages crawled by the proposed algorithm is obvious bigger than the traditional ones’, and the presented replicas-detecting policies can effectively reduce the time consumption of the algorithm.