CAJ | 학술논문

文本分类问题中，卡方特征选择是一种效果较好的特征选择方法。计算单词的卡方值时，先计算单词针对每个类别的卡方值，再通过类别概率将卡方值调和平均，作为单词相对于整个训练集合的卡方值，这种全局方法忽视了单词和类别间的相关性。针对这一问题，提出基于类别的卡方特征选择方法。基于类别的方法针对每个类别遴选特征词，特征词数量根据事先设定的阈值、类别的文档数和整个训练集合文档数计算得到，不同类别的特征空间可能包含相同的特征词。采用KNN分类方法，将基于类别的方法与全局方法进行比较，实验结果表明，基于类别的方法能够提高分类器的总体性能。
문본분류문제중，잡방특정선택시일충효과교호적특정선택방법。계산단사적잡방치시，선계산단사침대매개유별적잡방치，재통과유별개솔장잡방치조화평균，작위단사상대우정개훈련집합적잡방치，저충전국방법홀시료단사화유별간적상관성。침대저일문제，제출기우유별적잡방특정선택방법。기우유별적방법침대매개유별린선특정사，특정사수량근거사선설정적역치、유별적문당수화정개훈련집합문당수계산득도，불동유별적특정공간가능포함상동적특정사。채용KNN분류방법，장기우유별적방법여전국방법진행비교，실험결과표명，기우유별적방법능구제고분류기적총체성능。
In text categorization ,chi‐square feature selection is a better feature selection method .While the chi‐square value of a word being calculated ,the chi‐square value for each category is calculated first ,and then the harmonic mean is calculated by the category probability and the chi‐square value ,w hich serves as the chi‐square value of the word for the entire training set .This global approach ignores the correlation between words and categories .Aiming at this problem ,a chi‐square feature selection method based on category is pro‐posed ,which chooses feature words for each category .The number of feature words is calculated by pre‐set threshold ,the number of documents in the category and the number of documents in the entire training set . The feature space of different categories may contain the same feature words .Using K Nearest Neighbor (KNN) method ,it is compared with the global feature selection approach .Experimental results show the chi‐square feature selection method based on category can improve the overall performance of the classifier .