计算机工程与设计
計算機工程與設計
계산궤공정여설계
COMPUTER ENGINEERING AND DESIGN
2010年
3期
630-633
,共4页
未登录词%中文分词%网络蜘蛛%论坛语料
未登錄詞%中文分詞%網絡蜘蛛%論罈語料
미등록사%중문분사%망락지주%론단어료
unknown word%Chinese word segmentation%web spider%BBS corpus
为解决中文分词中未登录词识别效率低的问题,提出了基于论坛语料识别中文未登录词的新方法.利用网络蜘蛛下载论坛网页构建一个语料库,并对该语料库进行周期性的更新以获取具备较强时效性的语料;利用构造出的新统计量MD(由Mutual Information函数和Duplicated Combination Frequency函数构造)对语料库进行分词产生候选词表;最后通过对比候选词表与原始词表发现未登录词,并将识别出的未登陆词扩充到词库中.实验结果表明,该方法可以有效提高未登录词的识别效率.
為解決中文分詞中未登錄詞識彆效率低的問題,提齣瞭基于論罈語料識彆中文未登錄詞的新方法.利用網絡蜘蛛下載論罈網頁構建一箇語料庫,併對該語料庫進行週期性的更新以穫取具備較彊時效性的語料;利用構造齣的新統計量MD(由Mutual Information函數和Duplicated Combination Frequency函數構造)對語料庫進行分詞產生候選詞錶;最後通過對比候選詞錶與原始詞錶髮現未登錄詞,併將識彆齣的未登陸詞擴充到詞庫中.實驗結果錶明,該方法可以有效提高未登錄詞的識彆效率.
위해결중문분사중미등록사식별효솔저적문제,제출료기우론단어료식별중문미등록사적신방법.이용망락지주하재론단망혈구건일개어료고,병대해어료고진행주기성적경신이획취구비교강시효성적어료;이용구조출적신통계량MD(유Mutual Information함수화Duplicated Combination Frequency함수구조)대어료고진행분사산생후선사표;최후통과대비후선사표여원시사표발현미등록사,병장식별출적미등륙사확충도사고중.실험결과표명,해방법가이유효제고미등록사적식별효솔.
To deal with the problem of low efficiency of Chinese unknown word segmentation,a new method based on BBS corpus is presented.Network spider is used to download BBS web pages to build a corpus and this corpus is updated periodically in order to obtain a strong limitation.The new statistic MD (constructed by the mutual information function and duplicated combination frequency function) is used to segment the corpus to generate a candidate word list.By comparing candidate words list and the previous lexicon to recognize the unknown words,and added these unknown words into the basic lexicon.Experiments showed that the proposed method effectively improve the efficiency of identification of unknown words.