计算机工程
計算機工程
계산궤공정
COMPUTER ENGINEERING
2014年
11期
167-171
,共5页
杜芳华%冀俊忠%吴晨生%吴金源
杜芳華%冀俊忠%吳晨生%吳金源
두방화%기준충%오신생%오금원
文本分类%半监督学习%聚集信息素%自训练%Top-k策略%随机选择策略
文本分類%半鑑督學習%聚集信息素%自訓練%Top-k策略%隨機選擇策略
문본분류%반감독학습%취집신식소%자훈련%Top-k책략%수궤선택책략
text classification%semi-supervised learning%aggregation pheromone%self-training%Top-k strategy%random selection strategy
半监督文本分类中已标记数据与未标记数据分布不一致,可能导致分类器性能较低。为此,提出一种利用蚁群聚集信息素浓度的半监督文本分类算法。将聚集信息素与传统的文本相似度计算相融合,利用Top-k策略选取出未标记蚂蚁可能归属的种群,依据判断规则判定未标记蚂蚁的置信度,采用随机选择策略,把置信度高的未标记蚂蚁加入到对其最有吸引力的训练种群中。在标准数据集上与朴素贝叶斯算法和EM算法进行对比实验,结果表明,该算法在精确率、召回率以及F1度量方面都取得了更好的效果。
半鑑督文本分類中已標記數據與未標記數據分佈不一緻,可能導緻分類器性能較低。為此,提齣一種利用蟻群聚集信息素濃度的半鑑督文本分類算法。將聚集信息素與傳統的文本相似度計算相融閤,利用Top-k策略選取齣未標記螞蟻可能歸屬的種群,依據判斷規則判定未標記螞蟻的置信度,採用隨機選擇策略,把置信度高的未標記螞蟻加入到對其最有吸引力的訓練種群中。在標準數據集上與樸素貝葉斯算法和EM算法進行對比實驗,結果錶明,該算法在精確率、召迴率以及F1度量方麵都取得瞭更好的效果。
반감독문본분류중이표기수거여미표기수거분포불일치,가능도치분류기성능교저。위차,제출일충이용의군취집신식소농도적반감독문본분류산법。장취집신식소여전통적문본상사도계산상융합,이용Top-k책략선취출미표기마의가능귀속적충군,의거판단규칙판정미표기마의적치신도,채용수궤선택책략,파치신도고적미표기마의가입도대기최유흡인력적훈련충군중。재표준수거집상여박소패협사산법화EM산법진행대비실험,결과표명,해산법재정학솔、소회솔이급F1도량방면도취득료경호적효과。
There are many algorithms based on data distribution to effectively solve semi-supervised text categorization. However,they may perform badly when the labeled data distribution is different from the unlabeled data. This paper presents a semi-supervised text classification algorithm based on aggregation pheromone, which is used for species aggregation in real ants and other insects. The proposed method,which has no assumption regarding the data distribution, can be applied to any kind of data distribution. In light of aggregation pheromone,colonies that unlabeled ants may belong to are selected with a Top-k strategy. Then the confidence of unlabeled ants is determined by a judgment rule. Unlabeled ants with higher confidence are added into the most attractive training colony by a random selection strategy. Compared with Na?ve Bayes and EM algorithm,the experiments on benchmark dataset show that this algorithm performs better on precision,recall and Macro F1.