计算机应用研究
計算機應用研究
계산궤응용연구
APPLICATION RESEARCH OF COMPUTERS
2015年
9期
2581-2586
,共6页
布局相似性%网页正文提取%信息检索
佈跼相似性%網頁正文提取%信息檢索
포국상사성%망혈정문제취%신식검색
layout similarity%Web page content extract%information retrieval
合理的网页正文提取技术可以将海量互联网数据中冗余的、重复的、无用的信息去除,获取更加有实际意义和价值的数据。经过对网页的观察,发现同一网站下的网页具有在内容布局和样式结构上非常相似的特点,提出并实现了一种基于布局相似性的网页正文提取方法,即通过比对来自同一网站同一专题的网页DOM树中节点数据信息的相似性来实现正文提取,并对相关问题进行了尝试性的研究和实现。实验证明该方法思路简单、实用性强、普适性好,在满足较高准确率的同时,能为众多互联网内容分析应用提供支撑。
閤理的網頁正文提取技術可以將海量互聯網數據中冗餘的、重複的、無用的信息去除,穫取更加有實際意義和價值的數據。經過對網頁的觀察,髮現同一網站下的網頁具有在內容佈跼和樣式結構上非常相似的特點,提齣併實現瞭一種基于佈跼相似性的網頁正文提取方法,即通過比對來自同一網站同一專題的網頁DOM樹中節點數據信息的相似性來實現正文提取,併對相關問題進行瞭嘗試性的研究和實現。實驗證明該方法思路簡單、實用性彊、普適性好,在滿足較高準確率的同時,能為衆多互聯網內容分析應用提供支撐。
합리적망혈정문제취기술가이장해량호련망수거중용여적、중복적、무용적신식거제,획취경가유실제의의화개치적수거。경과대망혈적관찰,발현동일망참하적망혈구유재내용포국화양식결구상비상상사적특점,제출병실현료일충기우포국상사성적망혈정문제취방법,즉통과비대래자동일망참동일전제적망혈DOM수중절점수거신식적상사성래실현정문제취,병대상관문제진행료상시성적연구화실현。실험증명해방법사로간단、실용성강、보괄성호,재만족교고준학솔적동시,능위음다호련망내용분석응용제공지탱。
Appropriate Web content extraction technique can remove the data which is redundant,repetitive and useless from massive Web pages while extracting more meaningful and more useful data.Through the observation of Web pages,this paper proposed and implemented a Web content extraction method based on the layout similarity that the pages under the same Web site showed similar in content layout and style structure.It achieves the purpose of main content extraction by comparing the similarity of the DOMnode structure data from the Web pages belong to the same topic of the same sites.It also did some tenta-tive research and implementation on some other content relevent to this content extraction method.Experiments prove that this method is simple,pratical and universal,and it can not only meet the requirement of both high accuracy but also provide sup-port for more Internet applications of content analysis.