CAJ | 학술논문

Internet上的内容日益增多，搜索引擎返回的结果往往冗长。本文首先讨论Web页面文本与一般文本的四个不同点，然后介绍一种以统计方法为主、以匹配校验为辅的web页面中文文本主题自动提取方法，它能帮助用户在最短时间内了解当前页面的主题。实验显示，所提取的前15个字串，反映主题的平均正确率在85％以上，而处理时间仅为几十到几百毫秒。
Internet상적내용일익증다，수색인경반회적결과왕왕용장。본문수선토론Web혈면문본여일반문본적사개불동점，연후개소일충이통계방법위주、이필배교험위보적web혈면중문문본주제자동제취방법，타능방조용호재최단시간내료해당전혈면적주제。실험현시，소제취적전15개자천，반영주제적평균정학솔재85％이상，이처리시간부위궤십도궤백호초。
The information on the Internet is increasing quiclly. Search engines always feed back long-list of web sites and pages. In this paper, we firstly enumerate four differences between Web pages' text and the common texts,then introduce an automatic subject extracting method from Web pages' Ghinese text, mainly based on a statistical method and assisted with match-correcting. It can help the net users to master most of the subject of a Web page in the shortest time. The experiment results show that,the headed 15 strings in our resultcan reflect the subject with the precision of more than 85%, while the processing time ranges only tens to hundreds of milliseconds.