计算机科学与探索
計算機科學與探索
계산궤과학여탐색
JOURNAL OF FRONTIERS OF COMPUTER SCIENCE & TECHNOLOGY
2014年
10期
1187-1194
,共8页
杨镇雄%蔡祖锐%陈国华%汤庸%张龙
楊鎮雄%蔡祖銳%陳國華%湯庸%張龍
양진웅%채조예%진국화%탕용%장룡
分布式爬虫%开放存取期刊%插件机制
分佈式爬蟲%開放存取期刊%插件機製
분포식파충%개방존취기간%삽건궤제
distributed Web crawler%open access journal%plug-in mechanism
开放存取(open access,OA)期刊属于网络深层资源且分散在互联网中,传统的搜索引擎不能对其建立索引,不能满足用户获取OA期刊资源的需求,从而造成了开放资源的浪费。针对如何集中采集万维网上分散的开放存取期刊资源的问题,提出了一个面向OA期刊的分布式主题爬虫架构。该架构采用主从分布式设计,提出了基于用户预定义规则的OA期刊页面学术信息提取方法,由一个主控中心节点控制多个可动态增减的爬行节点,采用基于Chrome浏览器的插件机制来实现分布式爬行节点的可扩展性和部署的灵活性。
開放存取(open access,OA)期刊屬于網絡深層資源且分散在互聯網中,傳統的搜索引擎不能對其建立索引,不能滿足用戶穫取OA期刊資源的需求,從而造成瞭開放資源的浪費。針對如何集中採集萬維網上分散的開放存取期刊資源的問題,提齣瞭一箇麵嚮OA期刊的分佈式主題爬蟲架構。該架構採用主從分佈式設計,提齣瞭基于用戶預定義規則的OA期刊頁麵學術信息提取方法,由一箇主控中心節點控製多箇可動態增減的爬行節點,採用基于Chrome瀏覽器的插件機製來實現分佈式爬行節點的可擴展性和部署的靈活性。
개방존취(open access,OA)기간속우망락심층자원차분산재호련망중,전통적수색인경불능대기건립색인,불능만족용호획취OA기간자원적수구,종이조성료개방자원적낭비。침대여하집중채집만유망상분산적개방존취기간자원적문제,제출료일개면향OA기간적분포식주제파충가구。해가구채용주종분포식설계,제출료기우용호예정의규칙적OA기간혈면학술신식제취방법,유일개주공중심절점공제다개가동태증감적파행절점,채용기우Chrome류람기적삽건궤제래실현분포식파행절점적가확전성화부서적령활성。
Open access journal is a kind of deep online resources and disperses on the Internet, and it is difficult for the traditional search engines to index these online resources, so the user can not access directly the open access journal via search engines, resulting in a waste of these open resources. This paper proposes a novel focused Web crawler with distributed architecture to collect the open access journal resources scattering throughout the Internet. This architecture adopts the distributed master-slave design, which consists of a master control center and multiple distributed crawler nodes, and proposes an academic information extraction method based on user predefined rules from the open access journals. These distributed crawling nodes can be adjusted dynamically and use Chrome browser based plug-in mechanism to achieve scalability and deployment flexibility.