计算技术与自动化
計算技術與自動化
계산기술여자동화
COMPUTING TECHNOLOGY AND AUTOMATION
2014年
3期
126-133
,共8页
杨济运%刘建勋%姜磊%彭桃%文一凭%卢厅
楊濟運%劉建勛%薑磊%彭桃%文一憑%盧廳
양제운%류건훈%강뢰%팽도%문일빙%로청
协程%分布式%高性能%爬虫
協程%分佈式%高性能%爬蟲
협정%분포식%고성능%파충
coroutine%distribution%high-performance%web crawler
网络爬虫主要受到网络延迟和本地运行效率的限制,传统的基于多线程的网络爬虫架构主要为了消除网络延迟而没有考虑到本地运行效率。在高并发的条件下,多线程架构爬虫由于上下文切换开销增大而导致本地运行效率降低,同时使得网络利用率下降,如何能够在最大化利用网络资源的情况下减小系统本地开销是一个需要研究的问题。针对以上问题,本文提出基于协程的分布式网络爬虫框架来解决,从开销、资源利用率、网络利用率上对协程框架和多线程框架进行了分析,并基于协程实现了一个分布式网络爬虫。实验表明该框架无论从开销、资源利用率和网络利用率上相对于多线程框架有比较明显的优势。
網絡爬蟲主要受到網絡延遲和本地運行效率的限製,傳統的基于多線程的網絡爬蟲架構主要為瞭消除網絡延遲而沒有攷慮到本地運行效率。在高併髮的條件下,多線程架構爬蟲由于上下文切換開銷增大而導緻本地運行效率降低,同時使得網絡利用率下降,如何能夠在最大化利用網絡資源的情況下減小繫統本地開銷是一箇需要研究的問題。針對以上問題,本文提齣基于協程的分佈式網絡爬蟲框架來解決,從開銷、資源利用率、網絡利用率上對協程框架和多線程框架進行瞭分析,併基于協程實現瞭一箇分佈式網絡爬蟲。實驗錶明該框架無論從開銷、資源利用率和網絡利用率上相對于多線程框架有比較明顯的優勢。
망락파충주요수도망락연지화본지운행효솔적한제,전통적기우다선정적망락파충가구주요위료소제망락연지이몰유고필도본지운행효솔。재고병발적조건하,다선정가구파충유우상하문절환개소증대이도치본지운행효솔강저,동시사득망락이용솔하강,여하능구재최대화이용망락자원적정황하감소계통본지개소시일개수요연구적문제。침대이상문제,본문제출기우협정적분포식망락파충광가래해결,종개소、자원이용솔、망락이용솔상대협정광가화다선정광가진행료분석,병기우협정실현료일개분포식망락파충。실험표명해광가무론종개소、자원이용솔화망락이용솔상상대우다선정광가유비교명현적우세。
Web crawler is mainly limited by the network latency and local resource.The traditional framework of web crawler,which is based on multi-threads,is mainly to eliminate the network latency but failed to take the local resource limi-tation into account.Under the high concurrent,multi-threads architecture will result in a poor running efficiency because of the increasing of the context switch.So studying on how to make maximum usage of network resources and also considering the local resource limitation becomes a necessary.To solve the above problems,this paper will propose a distributed crawler framework based on coroutine.First we have analyzed the overhead,resource utilization and network utilization between co-routines and threads,and implemented a web crawler based on coroutine.Experiments had shown that our architecture for a distributed web crawler based on coroutine is better than threads-based web crawler.