计算机技术与发展
計算機技術與髮展
계산궤기술여발전
COMPUTER TECHNOLOGY AND DEVELOPMENT
2010年
1期
9-13
,共5页
特征提取%文本分类%短语切分%权值调整
特徵提取%文本分類%短語切分%權值調整
특정제취%문본분류%단어절분%권치조정
feature extraction%text classification%phrase segmentation%weight coordination
Internet文本信息量极速增加,在组织和处理这些文本数据时,文本分类技术显得尤为重要.利用统计学理论,特征提取和权重计算常常忽略了特征项之间的语法关系.文中提出了一种将短语切分与文本分类相结合的新方法. 在经过TFIDF计算之后,在同一个短语中,特征项之间的关系被计算出来,然后调整权值向量,最后可以得到文本分类的正确率.同一般地文本分类方法相比,加入短语切分的文本分类方法的正确率平均提高了1.5%以上.
Internet文本信息量極速增加,在組織和處理這些文本數據時,文本分類技術顯得尤為重要.利用統計學理論,特徵提取和權重計算常常忽略瞭特徵項之間的語法關繫.文中提齣瞭一種將短語切分與文本分類相結閤的新方法. 在經過TFIDF計算之後,在同一箇短語中,特徵項之間的關繫被計算齣來,然後調整權值嚮量,最後可以得到文本分類的正確率.同一般地文本分類方法相比,加入短語切分的文本分類方法的正確率平均提高瞭1.5%以上.
Internet문본신식량겁속증가,재조직화처리저사문본수거시,문본분류기술현득우위중요.이용통계학이론,특정제취화권중계산상상홀략료특정항지간적어법관계.문중제출료일충장단어절분여문본분류상결합적신방법. 재경과TFIDF계산지후,재동일개단어중,특정항지간적관계피계산출래,연후조정권치향량,최후가이득도문본분류적정학솔.동일반지문본분류방법상비,가입단어절분적문본분류방법적정학솔평균제고료1.5%이상.
With the rapid growth of textual information on Internet, text classification has become a more important key technology in organizing and processing large amount of document data. General statistics method of feature extraction and weight calculation ignores the syntax relationship between terms. A new method of how the phrase segmentation is brought into text classification is discussed. After TFIDF is calculated, the relationship between features in a phrase is evaluated. Then coordinate the weight vector with the relationship information. Finally, the accuracy of text classification is evaluated. Compared with the general method of text classification, the improvement of accuracy of text classification with phrase segmentation is increased by 1.5%.