科技通报
科技通報
과기통보
BULLETIN OF SCIENCE AND TECHNOLOGY
2014年
4期
206-208
,共3页
网络爬虫算法%URL定位信息%BBS信息检索%数据挖掘
網絡爬蟲算法%URL定位信息%BBS信息檢索%數據挖掘
망락파충산법%URL정위신식%BBS신식검색%수거알굴
network crawler algorithm%URL location information%BBS information retrieval%data mining
利用Web页面的采集序位和被检索页面的相关信息和主题,使得以主题为分块的网络爬虫算法,能够尽可能多地把整个Web按照主题为依据进行分块整合,可以采用对URL定位信息,提高了页面的高效检索能力。仿真实验中表明,提出的主题相关爬虫算法能够跨越BBS中URL网页中的断裂带,提高了URL网页的召回率,也不至于因为网页的断裂而中止检索。算法精度分析表明,误判点都在等分线附近徘徊,偏差不大,表明算法精度较高。
利用Web頁麵的採集序位和被檢索頁麵的相關信息和主題,使得以主題為分塊的網絡爬蟲算法,能夠儘可能多地把整箇Web按照主題為依據進行分塊整閤,可以採用對URL定位信息,提高瞭頁麵的高效檢索能力。倣真實驗中錶明,提齣的主題相關爬蟲算法能夠跨越BBS中URL網頁中的斷裂帶,提高瞭URL網頁的召迴率,也不至于因為網頁的斷裂而中止檢索。算法精度分析錶明,誤判點都在等分線附近徘佪,偏差不大,錶明算法精度較高。
이용Web혈면적채집서위화피검색혈면적상관신식화주제,사득이주제위분괴적망락파충산법,능구진가능다지파정개Web안조주제위의거진행분괴정합,가이채용대URL정위신식,제고료혈면적고효검색능력。방진실험중표명,제출적주제상관파충산법능구과월BBS중URL망혈중적단렬대,제고료URL망혈적소회솔,야불지우인위망혈적단렬이중지검색。산법정도분석표명,오판점도재등분선부근배회,편차불대,표명산법정도교고。
The collection sequences of Web pages and the relative information and focuses were taken in use, and made the network crawler algorithm divide and integrate the Web pages based on the focuses, the URL location information was used and the performance of efficient retrieval for the pages was improved. Simulation and experiments were taken based on the real BBS, and result shows that the focused relative crawler algorithm which proposed here can overcome the fracture zone of the URL pages in the BBS, and the recall rate of URL information is improved and the retrieval cannot be discontinued for the fracture. The precision analysis result of the algorithm shows that the erroneous judge points are distributed around the accurate judge line, the result is good.