计算机应用与软件
計算機應用與軟件
계산궤응용여연건
COMPUTER APPLICATIONS AND SOFTWARE
2014年
2期
174-176,181
,共4页
王锦波%王莲芝%高万林%喻健
王錦波%王蓮芝%高萬林%喻健
왕금파%왕련지%고만림%유건
朴素贝叶斯%组合词识别%词语特征项%提取
樸素貝葉斯%組閤詞識彆%詞語特徵項%提取
박소패협사%조합사식별%사어특정항%제취
Nave Bayes%Compound recognition%Word and expression feature item%Keyword extraction
为了提高关键词提取的准确率,在利用文本中相同词的前后词共现频率识别组合词的基础上,提出一种基于改进词语统计特征的朴素贝叶斯关键词提取算法。该算法选取词语的词长、词性、位置、TF-IDF值作为词语的特征项,改进了统计词长、TF-IDF和词频的方法,使长词和TF-IDF大的词具有更高的概率,而在统计词频时,考虑了词语之间包含与被包含的关系。然后,采用朴素贝叶斯模型对标记好关键词的文本进行训练,获得各个特征项出现的概率,用来提取文本的关键词。实验表明,与传统基于词频和决策树C4.5的关键词提取算法相比,采用该方法提取的关键词具有更高的准确率和可读性。
為瞭提高關鍵詞提取的準確率,在利用文本中相同詞的前後詞共現頻率識彆組閤詞的基礎上,提齣一種基于改進詞語統計特徵的樸素貝葉斯關鍵詞提取算法。該算法選取詞語的詞長、詞性、位置、TF-IDF值作為詞語的特徵項,改進瞭統計詞長、TF-IDF和詞頻的方法,使長詞和TF-IDF大的詞具有更高的概率,而在統計詞頻時,攷慮瞭詞語之間包含與被包含的關繫。然後,採用樸素貝葉斯模型對標記好關鍵詞的文本進行訓練,穫得各箇特徵項齣現的概率,用來提取文本的關鍵詞。實驗錶明,與傳統基于詞頻和決策樹C4.5的關鍵詞提取算法相比,採用該方法提取的關鍵詞具有更高的準確率和可讀性。
위료제고관건사제취적준학솔,재이용문본중상동사적전후사공현빈솔식별조합사적기출상,제출일충기우개진사어통계특정적박소패협사관건사제취산법。해산법선취사어적사장、사성、위치、TF-IDF치작위사어적특정항,개진료통계사장、TF-IDF화사빈적방법,사장사화TF-IDF대적사구유경고적개솔,이재통계사빈시,고필료사어지간포함여피포함적관계。연후,채용박소패협사모형대표기호관건사적문본진행훈련,획득각개특정항출현적개솔,용래제취문본적관건사。실험표명,여전통기우사빈화결책수C4.5적관건사제취산법상비,채용해방법제취적관건사구유경고적준학솔화가독성。
In order to improve the keyword extraction accuracy,based on recognising the compound by using co-occurrence frequency of the words before and after the identical words in text,we propose a nave Bayesian keyword extraction algorithm which is based on the improvement of statistical characteristics of words and expressions.The algorithm selects the word length,the part of speech,the position and the TF-IDF value of the words and expressions as the feature items of the words and expressions,improves the method of counting the word length,TF-IDF and word frequency,makes those words with longer length and higher TF-IDF value have higher probability.While counting the word frequency,it considers the relationship of containing and to be contained between the words.Then,it uses nave Bayesian model to train the texts with the keywords marked and to get the occurrence probability of each feature item for extracting the keywords of text. According to the experiment,the keywords extracted by the algorithm in this paper have a higher precision rate and readability than by the traditional word frequency-based and decision tree C4.5-based keyword extraction algorithms.