CAJ | 학술논문

本文提出了一种快速的高频字串提取和统计方法。使用Hash技术，该方法不需要词典，也不需要语料库的训练，不进行分词操作，依靠统计信息，提取高频字串。用语言学知识进行前缀后缀等处理后，得到的高频字串可以作为未登录词处理、歧义消解和加权处理等的辅助信息。实验显示了该方法速度较快且不受文章本身的限制，在处理小说等真实文本时体现了较高的可用性。
본문제출료일충쾌속적고빈자천제취화통계방법。사용Hash기술，해방법불수요사전，야불수요어료고적훈련，불진행분사조작，의고통계신식，제취고빈자천。용어언학지식진행전철후철등처리후，득도적고빈자천가이작위미등록사처리、기의소해화가권처리등적보조신식。실험현시료해방법속도교쾌차불수문장본신적한제，재처리소설등진실문본시체현료교고적가용성。
In this paper we describe a fast high-frequency strings extracting algorithm. Our approach uses HASH technology to avoid relying on corpus and word segmentation. To extract the high frequency strings, we only use statistics information. After processing the prefixes and suffixes, the high frequency strings we get can be the supplement knowledge for the un- login words processing, word disambiguation and word weighting. The experimental results show that it has a high speed and can work on arbitrary texts. Our method has good effect when processing novels and other real texts.