东南大学学报(自然科学版)
東南大學學報(自然科學版)
동남대학학보(자연과학판)
JOURNAL OF SOUTHEAST UNIVERSITY
2014年
2期
256-260
,共5页
袁满%欧阳元新%熊璋%罗建辉
袁滿%歐暘元新%熊璋%囉建輝
원만%구양원신%웅장%라건휘
频繁项目集%短文本分类%特征扩展
頻繁項目集%短文本分類%特徵擴展
빈번항목집%단문본분류%특정확전
frequent term sets%short text classification%feature extension
为了解决向量空间模型(VSM)对短文本内容表示能力不足的问题,提出了一种基于频繁词集的特征扩展方法。定义了单词间的共现关系和类别同向关系,通过计算单词集的支持度和置信度,挖掘出具有相同类别倾向的频繁词集,并将其作为短文本特征扩展的背景知识库。对于短文本中的每个原始单词,从背景知识库中查找包含有该单词的频繁词集,将其作为扩展特征加入原特征向量中。搜狗语料集上的实验结果表明,置信度和支持度对背景知识库的规模有较大的影响,但是扩展过多的特征存在冗余性,对分类效果没有进一步的提升。基于频繁词集构建的短文本背景知识库可以作为有效的扩展特征;当训练文本数较为有限时,特征扩展对支持向量机SVM的分类效果有显著的提升。
為瞭解決嚮量空間模型(VSM)對短文本內容錶示能力不足的問題,提齣瞭一種基于頻繁詞集的特徵擴展方法。定義瞭單詞間的共現關繫和類彆同嚮關繫,通過計算單詞集的支持度和置信度,挖掘齣具有相同類彆傾嚮的頻繁詞集,併將其作為短文本特徵擴展的揹景知識庫。對于短文本中的每箇原始單詞,從揹景知識庫中查找包含有該單詞的頻繁詞集,將其作為擴展特徵加入原特徵嚮量中。搜狗語料集上的實驗結果錶明,置信度和支持度對揹景知識庫的規模有較大的影響,但是擴展過多的特徵存在冗餘性,對分類效果沒有進一步的提升。基于頻繁詞集構建的短文本揹景知識庫可以作為有效的擴展特徵;噹訓練文本數較為有限時,特徵擴展對支持嚮量機SVM的分類效果有顯著的提升。
위료해결향량공간모형(VSM)대단문본내용표시능력불족적문제,제출료일충기우빈번사집적특정확전방법。정의료단사간적공현관계화유별동향관계,통과계산단사집적지지도화치신도,알굴출구유상동유별경향적빈번사집,병장기작위단문본특정확전적배경지식고。대우단문본중적매개원시단사,종배경지식고중사조포함유해단사적빈번사집,장기작위확전특정가입원특정향량중。수구어료집상적실험결과표명,치신도화지지도대배경지식고적규모유교대적영향,단시확전과다적특정존재용여성,대분류효과몰유진일보적제승。기우빈번사집구건적단문본배경지식고가이작위유효적확전특정;당훈련문본수교위유한시,특정확전대지지향량궤SVM적분류효과유현저적제승。
A short text feature extension method based on frequent term sets is proposed to overcome the drawbacks of the vector space model (VSM)on representing short text content.After defining the co-occurring and class orientation relations between terms,frequent term sets with identical class orientation are generated by calculating the support and confidence of word sets,and then are taken as the background knowledge for short text feature extension.For each single term of the short text, the term sets containing this term are found in the background knowledge and added into the original term vector as the feature extension.The experimental results on Sougou corpus show that the sup-port and confidence have great impact on the scale of the background knowledge,but excessive ex-tension also has redundancy and cannot obtain further improvement.The background knowledge based on frequent term sets is an effective way for feature extension.When the number of the train-ing documents is limited,these extended features can greatly improve the classification results of the support vector mechine(SVM).