计算机应用与软件
計算機應用與軟件
계산궤응용여연건
Computer Applications and Software
2015年
9期
17-21
,共5页
Web信息抽取%隐Web%网页爬虫
Web信息抽取%隱Web%網頁爬蟲
Web신식추취%은Web%망혈파충
Web information extraction%Hidden%Web%Web crawler
由于网页大量包含动态JavaScript脚本,造成大部分网页内容对传统的网页爬虫不可见。为此,提出一种基于DOM状态转换的隐网页信息抽取算法。该算法增量地构建DOM状态转换机,以DOM节点及其点击事件作为状态机的输入事件。对能够引起目标节点变化的转换路径进行递归搜索;通过重放点击路径,自动完成目标节点的内容抓取;通过覆盖监听器方法原型,获取DOM树中所有可点击的节点作为候选节点。该算法应用RTDM算法和自定义过滤器来对DOM状态空间进行压缩,以缩减搜索空间,定义DOM树中候选节点到目标节点的距离作为h打分,进行启发式搜索。实验表明,所研究算法性能优良,对隐网页内容的抽取准确率达到89.48%,可应用在网页自动化测试、网页爬虫等领域。
由于網頁大量包含動態JavaScript腳本,造成大部分網頁內容對傳統的網頁爬蟲不可見。為此,提齣一種基于DOM狀態轉換的隱網頁信息抽取算法。該算法增量地構建DOM狀態轉換機,以DOM節點及其點擊事件作為狀態機的輸入事件。對能夠引起目標節點變化的轉換路徑進行遞歸搜索;通過重放點擊路徑,自動完成目標節點的內容抓取;通過覆蓋鑑聽器方法原型,穫取DOM樹中所有可點擊的節點作為候選節點。該算法應用RTDM算法和自定義過濾器來對DOM狀態空間進行壓縮,以縮減搜索空間,定義DOM樹中候選節點到目標節點的距離作為h打分,進行啟髮式搜索。實驗錶明,所研究算法性能優良,對隱網頁內容的抽取準確率達到89.48%,可應用在網頁自動化測試、網頁爬蟲等領域。
유우망혈대량포함동태JavaScript각본,조성대부분망혈내용대전통적망혈파충불가견。위차,제출일충기우DOM상태전환적은망혈신식추취산법。해산법증량지구건DOM상태전환궤,이DOM절점급기점격사건작위상태궤적수입사건。대능구인기목표절점변화적전환로경진행체귀수색;통과중방점격로경,자동완성목표절점적내용조취;통과복개감은기방법원형,획취DOM수중소유가점격적절점작위후선절점。해산법응용RTDM산법화자정의과려기래대DOM상태공간진행압축,이축감수색공간,정의DOM수중후선절점도목표절점적거리작위h타분,진행계발식수색。실험표명,소연구산법성능우량,대은망혈내용적추취준학솔체도89.48%,가응용재망혈자동화측시、망혈파충등영역。
A great deal of dynamic JavaScript containing in webpages leads to most of the webpage contents being invisible to traditional webpage crawlers.Therefore we proposed a DOM state transfer-based hidden webpage information extraction algorithm.The algorithm incrementally constructs the DOM state transfer machine and uses DOM nodes and their click events as the inputting events of transfer machine.For the transfer paths which can result in the variation of object nodes,recursive search will be done;By the playback of click path it automatically completes contents grasping of the object nodes;By covering the prototype of audiomonitor method it obtains all the clickable nodes in DOM tree as the candidate nodes.The algorithm employs RTDM algorithm and self-defined filter to compress DOM state space in order to shrink the search space,and carries out heuristic search by defining the distance between candidate nodes in DOM tree and object nodes as the h marking.Experiment demonstrated that the algorithm studied has excellent performance,it achieved 89.48% accuracy in hidden webpage content extraction,and could be used in the fields of automatic webpage test and webpage crawler,etc.