安徽广播电视大学学报
安徽廣播電視大學學報
안휘엄파전시대학학보
Journal of Anhui Radio & TV University
2015年
3期
124-128
,共5页
文本分类%卡方%特征选择%特征词%KNN分类
文本分類%卡方%特徵選擇%特徵詞%KNN分類
문본분류%잡방%특정선택%특정사%KNN분류
text categorization%Chi-square%feature selection%feature word%KNN categorization
文本分类问题中,卡方特征选择是一种效果较好的特征选择方法。计算单词的卡方值时,先计算单词针对每个类别的卡方值,再通过类别概率将卡方值调和平均,作为单词相对于整个训练集合的卡方值,这种全局方法忽视了单词和类别间的相关性。针对这一问题,提出基于类别的卡方特征选择方法。基于类别的方法针对每个类别遴选特征词,特征词数量根据事先设定的阈值、类别的文档数和整个训练集合文档数计算得到,不同类别的特征空间可能包含相同的特征词。采用KNN分类方法,将基于类别的方法与全局方法进行比较,实验结果表明,基于类别的方法能够提高分类器的总体性能。
文本分類問題中,卡方特徵選擇是一種效果較好的特徵選擇方法。計算單詞的卡方值時,先計算單詞針對每箇類彆的卡方值,再通過類彆概率將卡方值調和平均,作為單詞相對于整箇訓練集閤的卡方值,這種全跼方法忽視瞭單詞和類彆間的相關性。針對這一問題,提齣基于類彆的卡方特徵選擇方法。基于類彆的方法針對每箇類彆遴選特徵詞,特徵詞數量根據事先設定的閾值、類彆的文檔數和整箇訓練集閤文檔數計算得到,不同類彆的特徵空間可能包含相同的特徵詞。採用KNN分類方法,將基于類彆的方法與全跼方法進行比較,實驗結果錶明,基于類彆的方法能夠提高分類器的總體性能。
문본분류문제중,잡방특정선택시일충효과교호적특정선택방법。계산단사적잡방치시,선계산단사침대매개유별적잡방치,재통과유별개솔장잡방치조화평균,작위단사상대우정개훈련집합적잡방치,저충전국방법홀시료단사화유별간적상관성。침대저일문제,제출기우유별적잡방특정선택방법。기우유별적방법침대매개유별린선특정사,특정사수량근거사선설정적역치、유별적문당수화정개훈련집합문당수계산득도,불동유별적특정공간가능포함상동적특정사。채용KNN분류방법,장기우유별적방법여전국방법진행비교,실험결과표명,기우유별적방법능구제고분류기적총체성능。
In text categorization ,chi‐square feature selection is a better feature selection method .While the chi‐square value of a word being calculated ,the chi‐square value for each category is calculated first ,and then the harmonic mean is calculated by the category probability and the chi‐square value ,w hich serves as the chi‐square value of the word for the entire training set .This global approach ignores the correlation between words and categories .Aiming at this problem ,a chi‐square feature selection method based on category is pro‐posed ,which chooses feature words for each category .The number of feature words is calculated by pre‐set threshold ,the number of documents in the category and the number of documents in the entire training set . The feature space of different categories may contain the same feature words .Using K Nearest Neighbor (KNN) method ,it is compared with the global feature selection approach .Experimental results show the chi‐square feature selection method based on category can improve the overall performance of the classifier .