CAJ | 학술논문

基于包装器的信息抽取方法只能处理一种特定的信息源,而且对网页结构的依赖性强.基于此提出了一种将中文标点符号和HTML树结构作为识别网页正文内容重要特征的网页分析方法,通过统计中文标点符号确定部分正文信息,然后根据正文信息在结构上的相似性确定其他正文信息内容.实验结果表明该方法能有效地剔除网页噪音并提取网页正文,具有较好的通用性和较高的准确性.
기우포장기적신식추취방법지능처리일충특정적신식원,이차대망혈결구적의뢰성강.기우차제출료일충장중문표점부호화HTML수결구작위식별망혈정문내용중요특정적망혈분석방법,통과통계중문표점부호학정부분정문신식,연후근거정문신식재결구상적상사성학정기타정문신식내용.실험결과표명해방법능유효지척제망혈조음병제취망혈정문,구유교호적통용성화교고적준학성.
The approach to data extraction based on wrapper is limited to one specific information source,and greatly depends on web page structure.A new web page analysis method is proposed,which can recognize web page content according to the number of Chinese punctuations and HTML tree structure.It can eliminate noise and extract content from web page effectively.Parts of contents are confirmed by Chinese punctuations,while other parts are found by the similarity among contents.Experimental results show that this method is accurate and suitable for most web sites.