合肥工业大学学报(自然科学版)
閤肥工業大學學報(自然科學版)
합비공업대학학보(자연과학판)
JOURNAL OF HEFEI UNIVERSITY OF TECHNOLOGY(NATURAL SCIENCE)
2014年
6期
674-678,724
,共6页
孙晓%李承程%叶嘉麒%任福继
孫曉%李承程%葉嘉麒%任福繼
손효%리승정%협가기%임복계
自然语言处理%中文分词%重复字串%分词碎片
自然語言處理%中文分詞%重複字串%分詞碎片
자연어언처리%중문분사%중복자천%분사쇄편
natural language processing%Chinese word segmentation%repeated string%sub-word fragment
文章基于重复字串的统计特征,同时分析微博中存在的口语化语言特点制定相应的语言规则,采用统计和规则相结合的方法,首先对微博的语料进行分词,然后从分词碎片中提取重复出现2次及2次以上的新词,通过多层过滤,得到最终的候选新词。实验结果证明,该方法有效地保证了较高的准确率和召回率,同时保证了新词的抽取速度。
文章基于重複字串的統計特徵,同時分析微博中存在的口語化語言特點製定相應的語言規則,採用統計和規則相結閤的方法,首先對微博的語料進行分詞,然後從分詞碎片中提取重複齣現2次及2次以上的新詞,通過多層過濾,得到最終的候選新詞。實驗結果證明,該方法有效地保證瞭較高的準確率和召迴率,同時保證瞭新詞的抽取速度。
문장기우중복자천적통계특정,동시분석미박중존재적구어화어언특점제정상응적어언규칙,채용통계화규칙상결합적방법,수선대미박적어료진행분사,연후종분사쇄편중제취중복출현2차급2차이상적신사,통과다층과려,득도최종적후선신사。실험결과증명,해방법유효지보증료교고적준학솔화소회솔,동시보증료신사적추취속도。
The characteristics of oral Microblogging text is studied to develop appropriate language rules ,and the statistics and rules based methods are combined based on the statistical characteristics of the repeated string .First ,the Microblogging corpus is segmented with the existing system dictionary . Then the new words that appear twice or more than twice are extracted from the sub-word fragments .Through the multi-layer filtering ,the candidate new words are recognized .The experimental results show that the method is ef-fective in ensuring higher levels of precision and recall rate as well as the extraction speed of the new words .