计算机工程与应用
計算機工程與應用
계산궤공정여응용
COMPUTER ENGINEERING AND APPLICATIONS
2013年
10期
140-146
,共7页
信息熵%特征权重%特征选择%文本分类
信息熵%特徵權重%特徵選擇%文本分類
신식적%특정권중%특정선택%문본분류
information entropy%term weighting%feature selection%text categorization
文本表示是使用分类算法处理文本时必不可少的环节,文本表示方法的选择对最终的分类精度起着至关重要的作用.针对经典的特征权重计算方法TFIDF(Term Frequency and Inverted Document Frequency)中存在的不足,提出了一种基于信息熵理论的特征权重算法ETFIDF(Entropy based TFIDF).ETFIDF不仅考虑特征项在文档中出现的频率及该特征项在训练集中的集中度,而且还考虑该特征项在各个类别中的分散度.实验结果表明,采用ETFIDF计算特征权重可以有效地提高文本分类性能,对ETFIDF与特征选择的关系进行了较详细的理论分析和实验研究.实验结果表明,在文本表示阶段考虑特征与类别的关系可以更为准确地表示文本;如果综合考虑精度与效率两个方面因素,ETFIDF算法与特征选择算法一起采用能够得到更好的分类效果.
文本錶示是使用分類算法處理文本時必不可少的環節,文本錶示方法的選擇對最終的分類精度起著至關重要的作用.針對經典的特徵權重計算方法TFIDF(Term Frequency and Inverted Document Frequency)中存在的不足,提齣瞭一種基于信息熵理論的特徵權重算法ETFIDF(Entropy based TFIDF).ETFIDF不僅攷慮特徵項在文檔中齣現的頻率及該特徵項在訓練集中的集中度,而且還攷慮該特徵項在各箇類彆中的分散度.實驗結果錶明,採用ETFIDF計算特徵權重可以有效地提高文本分類性能,對ETFIDF與特徵選擇的關繫進行瞭較詳細的理論分析和實驗研究.實驗結果錶明,在文本錶示階段攷慮特徵與類彆的關繫可以更為準確地錶示文本;如果綜閤攷慮精度與效率兩箇方麵因素,ETFIDF算法與特徵選擇算法一起採用能夠得到更好的分類效果.
문본표시시사용분류산법처리문본시필불가소적배절,문본표시방법적선택대최종적분류정도기착지관중요적작용.침대경전적특정권중계산방법TFIDF(Term Frequency and Inverted Document Frequency)중존재적불족,제출료일충기우신식적이론적특정권중산법ETFIDF(Entropy based TFIDF).ETFIDF불부고필특정항재문당중출현적빈솔급해특정항재훈련집중적집중도,이차환고필해특정항재각개유별중적분산도.실험결과표명,채용ETFIDF계산특정권중가이유효지제고문본분류성능,대ETFIDF여특정선택적관계진행료교상세적이론분석화실험연구.실험결과표명,재문본표시계단고필특정여유별적관계가이경위준학지표시문본;여과종합고필정도여효솔량개방면인소,ETFIDF산법여특정선택산법일기채용능구득도경호적분류효과.
Text representation is an important process to perform text categorization, and the method of text representation plays an important role in the final classification accuracy. This paper proposes a new term weighting algorithm ETFIDF(Entropy based TFIDF)based on information entropy theory to overcome the limitations of the traditional term weighting algorithm TFIDF (Term Frequency and Inverted Document Frequency). ETFIDF not only considers the number of times a term occurs in a document and the number of documents in training set in which a term occurs, but also takes into account the distribution of documents in the training set in which the term occurs. Experimental results show that ETFIDF outperforms TFIDF in text categorization. Furthermore, detailed theoretical analysis and experimental study on the relationship between ETFIDF and feature selection have been done in this paper. Experimental results show that, it can represent the text more accurately if we take into account the distri-bution of documents in the training set in which the term occurs in the text representation stage. Moreover, it can achieve higher performance for the combination of ETFIDF and feature selection algorithm if we consider both the accuracy and efficiency.