CAJ | 학술논문

由于网页大量包含动态JavaScript脚本，造成大部分网页内容对传统的网页爬虫不可见。为此，提出一种基于DOM状态转换的隐网页信息抽取算法。该算法增量地构建DOM状态转换机，以DOM节点及其点击事件作为状态机的输入事件。对能够引起目标节点变化的转换路径进行递归搜索；通过重放点击路径，自动完成目标节点的内容抓取；通过覆盖监听器方法原型，获取DOM树中所有可点击的节点作为候选节点。该算法应用RTDM算法和自定义过滤器来对DOM状态空间进行压缩，以缩减搜索空间，定义DOM树中候选节点到目标节点的距离作为h打分，进行启发式搜索。实验表明，所研究算法性能优良，对隐网页内容的抽取准确率达到89．48％，可应用在网页自动化测试、网页爬虫等领域。
유우망혈대량포함동태JavaScript각본，조성대부분망혈내용대전통적망혈파충불가견。위차，제출일충기우DOM상태전환적은망혈신식추취산법。해산법증량지구건DOM상태전환궤，이DOM절점급기점격사건작위상태궤적수입사건。대능구인기목표절점변화적전환로경진행체귀수색；통과중방점격로경，자동완성목표절점적내용조취；통과복개감은기방법원형，획취DOM수중소유가점격적절점작위후선절점。해산법응용RTDM산법화자정의과려기래대DOM상태공간진행압축，이축감수색공간，정의DOM수중후선절점도목표절점적거리작위h타분，진행계발식수색。실험표명，소연구산법성능우량，대은망혈내용적추취준학솔체도89．48％，가응용재망혈자동화측시、망혈파충등영역。
A great deal of dynamic JavaScript containing in webpages leads to most of the webpage contents being invisible to traditional webpage crawlers.Therefore we proposed a DOM state transfer-based hidden webpage information extraction algorithm.The algorithm incrementally constructs the DOM state transfer machine and uses DOM nodes and their click events as the inputting events of transfer machine.For the transfer paths which can result in the variation of object nodes,recursive search will be done;By the playback of click path it automatically completes contents grasping of the object nodes;By covering the prototype of audiomonitor method it obtains all the clickable nodes in DOM tree as the candidate nodes.The algorithm employs RTDM algorithm and self-defined filter to compress DOM state space in order to shrink the search space,and carries out heuristic search by defining the distance between candidate nodes in DOM tree and object nodes as the h marking.Experiment demonstrated that the algorithm studied has excellent performance,it achieved 89.48% accuracy in hidden webpage content extraction,and could be used in the fields of automatic webpage test and webpage crawler,etc.