电脑知识与技术
電腦知識與技術
전뇌지식여기술
COMPUTER KNOWLEDGE AND TECHNOLOGY
2015年
8期
36-38
,共3页
云计算%分布式网络爬虫%Hadoop
雲計算%分佈式網絡爬蟲%Hadoop
운계산%분포식망락파충%Hadoop
cloud computing%distributed web crawler%Hadoop
随着互联网行业和信息技术的发展,Google、IBM和Apache等大型公司纷纷投入去发展云计算,其中Apache开发的Hadoop平台是一个对用户极为友好的开源云计算框架。该文就是要基于Hadoop框架去设计和实现分布式网络爬虫技术,以完成大规模数据的采集,其中采用Map/Reduce分布式计算框架和分布式文件系统,来解决单机爬虫效率低、可扩展性差等问题,提高网页数据爬取速度并扩大爬取的规模。
隨著互聯網行業和信息技術的髮展,Google、IBM和Apache等大型公司紛紛投入去髮展雲計算,其中Apache開髮的Hadoop平檯是一箇對用戶極為友好的開源雲計算框架。該文就是要基于Hadoop框架去設計和實現分佈式網絡爬蟲技術,以完成大規模數據的採集,其中採用Map/Reduce分佈式計算框架和分佈式文件繫統,來解決單機爬蟲效率低、可擴展性差等問題,提高網頁數據爬取速度併擴大爬取的規模。
수착호련망행업화신식기술적발전,Google、IBM화Apache등대형공사분분투입거발전운계산,기중Apache개발적Hadoop평태시일개대용호겁위우호적개원운계산광가。해문취시요기우Hadoop광가거설계화실현분포식망락파충기술,이완성대규모수거적채집,기중채용Map/Reduce분포식계산광가화분포식문건계통,래해결단궤파충효솔저、가확전성차등문제,제고망혈수거파취속도병확대파취적규모。
with the rapid development of the Internet industry and information technology, Google, IBM and Apache and other Large Firm are input to the development of cloud computing, in which Apache Hadoop development platform is a very friendly to users of open source cloud computing framework. This paper is based on the Hadoop framework to design and implementation of a distributed web crawler technology, to complete the large-scale data collection, in which the Map/Reduce distributed computing framework and distributed file system, to solve the single crawler low efficiency, poor scalability issues, improve the Webpage crawling speed and expand the scale of crawling.