计算机研究与发展
計算機研究與髮展
계산궤연구여발전
JOURNAL OF COMPUTER RESEARCH AND DEVELOPMENT
2010年
1期
81-87
,共7页
许震%沙朝锋%王晓玲%周傲英
許震%沙朝鋒%王曉玲%週傲英
허진%사조봉%왕효령%주오영
半监督学习%非平衡%KL距离%朴素贝叶斯%logistic回归
半鑑督學習%非平衡%KL距離%樸素貝葉斯%logistic迴歸
반감독학습%비평형%KL거리%박소패협사%logistic회귀
semi-supervised learning%imbalance%KL divergence%na(i)ve Bayesian%logistic regression
在实际应用中,由于各种原因时常无法直接获得已标识反例,导致传统分类方法暂时失灵,因此,基于正例和未标识集的半监督学习顿时成了理论界研究的热点.研究者们提出了不同的解决方法,然而,这些方法都不能有效处理非平衡的分类问题,尤其当隐匿反例非常少或训练集中的实例分布不均匀时.因此,提出了一种基于KL距离的半监督分类算法--LiKL:依次挖掘出未标识集中的最可靠正例和反例,接着使用训练好的增强型分类器来分类.与其他方法相比,不仅提高了分类的查准率和查全率,而且具有鲁棒性.
在實際應用中,由于各種原因時常無法直接穫得已標識反例,導緻傳統分類方法暫時失靈,因此,基于正例和未標識集的半鑑督學習頓時成瞭理論界研究的熱點.研究者們提齣瞭不同的解決方法,然而,這些方法都不能有效處理非平衡的分類問題,尤其噹隱匿反例非常少或訓練集中的實例分佈不均勻時.因此,提齣瞭一種基于KL距離的半鑑督分類算法--LiKL:依次挖掘齣未標識集中的最可靠正例和反例,接著使用訓練好的增彊型分類器來分類.與其他方法相比,不僅提高瞭分類的查準率和查全率,而且具有魯棒性.
재실제응용중,유우각충원인시상무법직접획득이표식반례,도치전통분류방법잠시실령,인차,기우정례화미표식집적반감독학습돈시성료이론계연구적열점.연구자문제출료불동적해결방법,연이,저사방법도불능유효처리비평형적분류문제,우기당은닉반례비상소혹훈련집중적실례분포불균균시.인차,제출료일충기우KL거리적반감독분류산법--LiKL:의차알굴출미표식집중적최가고정례화반례,접착사용훈련호적증강형분류기래분류.여기타방법상비,불부제고료분류적사준솔화사전솔,이차구유로봉성.
In many real applications, it's often difficult or quite expensive to get labeled negative examples for learning, such as Web search, medical diagnosis, earthquake identification and so on. This situation makes the traditional classification techniques work ineffectively, because the precondition that every class has to own its labeled instances is not met. Therefore, the semi-supervised learning method from positive and unlabeled data becomes a hot topic in the literature. In the past years, researchers have proposed many methods, but they can't cope well with the imbalanced classification problem, especially when the number of hidden negative examples in the unlabeled set is relatively small or the distribution of training examples in the training set becomes quite different. In this paper, a novel KL divergence-based semi-supervised classification algorithm, named LiKL (i.e. semi-supervised learning algorithm from imbalanced data based on KL divergence), is proposed to tackle this special problem. The proposed approach firstly finds likely positive examples existing in the unlabeled set, and successively finds likely negative ones, followed by an enhanced logistic regression classifier to classify the unlabeled set. The experiments show that the proposed approach not only improves precision and recall, but also is very robust, compared with former work in the literature.