计算机技术与发展
計算機技術與髮展
계산궤기술여발전
COMPUTER TECHNOLOGY AND DEVELOPMENT
2014年
1期
98-101
,共4页
新词识别%支持向量机%约束条件%核函数
新詞識彆%支持嚮量機%約束條件%覈函數
신사식별%지지향량궤%약속조건%핵함수
new word identification%SVM%constraint conditions%kernel function
中文分词的关键技术之一在于如何正确切分新词,文中提出了一种新的识别新词的方法。借助支持向量机良好的分类性,首先对借助分词词典进行分词和词性标注过的训练语料中抽取正负样本,然后结合从训练语料中计算出的各种词本身特征进行向量化,通过支持向量机的训练得到新词分类支持向量。对含有模拟新词的测试语料进行分词和词性标注,结合提出的相关约束条件和松弛变量选取候选新词,通过与词本身特征结合进行向量化后作为输入与通过训练得到的支持向量机分类器进行计算,得到的相关结果与阈值进行比较,当结果小于阈值时判定为一个新词,而计算结果大于阈值的词为非新词。通过实验结果比较选取最合适的支持向量机核函数。
中文分詞的關鍵技術之一在于如何正確切分新詞,文中提齣瞭一種新的識彆新詞的方法。藉助支持嚮量機良好的分類性,首先對藉助分詞詞典進行分詞和詞性標註過的訓練語料中抽取正負樣本,然後結閤從訓練語料中計算齣的各種詞本身特徵進行嚮量化,通過支持嚮量機的訓練得到新詞分類支持嚮量。對含有模擬新詞的測試語料進行分詞和詞性標註,結閤提齣的相關約束條件和鬆弛變量選取候選新詞,通過與詞本身特徵結閤進行嚮量化後作為輸入與通過訓練得到的支持嚮量機分類器進行計算,得到的相關結果與閾值進行比較,噹結果小于閾值時判定為一箇新詞,而計算結果大于閾值的詞為非新詞。通過實驗結果比較選取最閤適的支持嚮量機覈函數。
중문분사적관건기술지일재우여하정학절분신사,문중제출료일충신적식별신사적방법。차조지지향량궤량호적분류성,수선대차조분사사전진행분사화사성표주과적훈련어료중추취정부양본,연후결합종훈련어료중계산출적각충사본신특정진행향양화,통과지지향량궤적훈련득도신사분류지지향량。대함유모의신사적측시어료진행분사화사성표주,결합제출적상관약속조건화송이변량선취후선신사,통과여사본신특정결합진행향양화후작위수입여통과훈련득도적지지향량궤분류기진행계산,득도적상관결과여역치진행비교,당결과소우역치시판정위일개신사,이계산결과대우역치적사위비신사。통과실험결과비교선취최합괄적지지향량궤핵함수。
One of the key technologies of Chinese word segmentation is how to segment the new words correctly,present a new method a-bout the study of identification for new words. With the support of good classification of SVM,first extract the positive and negative sam-ples from training corpus which was handled by segmentation and POS tagging according to the dictionary,then combining with all kinds of words' classification which was gotten from training corpus,gain the new word support vector through the training of supporting vec-tor machine. Word segmentation and POS tagging on the test of corpus containing simulated new words,in conjunction with the relevant constraints and the slack variables are proposed to select candidate new words,as to the quantized input and support vector machine classi-fier calculate by combining with the word itself characteristics,getting the relevant results is compared with a threshold,when the result is less than the threshold determine it a new word,and when the calculation results are greater than the threshold determine it non-new word. Through the comparison of experimental results is to select the most suitable kernel function of support vector machine.