计算机应用研究
計算機應用研究
계산궤응용연구
Application Research of Computers
2015年
11期
3324-3327,3331
,共5页
张荷%李梅%张阳%蔡晓妍
張荷%李梅%張暘%蔡曉妍
장하%리매%장양%채효연
软件故障检测%正例和未标注学习%不平衡数据%决策树%集成分类器
軟件故障檢測%正例和未標註學習%不平衡數據%決策樹%集成分類器
연건고장검측%정례화미표주학습%불평형수거%결책수%집성분류기
software fault prediction%PU learning%unbalanced data%decision tree%ensemble classifier
针对软件故障数据中正例样本相对较少且大量样本标注困难的现实场景,已知未标注样本中包含用于建立故障检测模型的大量有用信息,提出仅用正例和未标注数据构建分类模型对软件开发过程中的故障进行检测的半监督学习方法。首先采用合成少数类过采样 SMOTE 算法对数据集中的正例样本进行过采样,平衡数据集中的类分布。在此基础上合理构建正例集合和未标注集合,采用 POSC 4.5和 Bagging 算法构建软件故障决策树集成分类器。通过对 NASA MDP 数据库中的12个数据集进行对比实验,结果表明,仅用正例和未标注数据建模可以得到与有监督学习方法相近的软件故障检测率,且集成分类器方法比单分类器方法具有更高的检测率,未标注样本集大小的软件故障检测率同样有影响。
針對軟件故障數據中正例樣本相對較少且大量樣本標註睏難的現實場景,已知未標註樣本中包含用于建立故障檢測模型的大量有用信息,提齣僅用正例和未標註數據構建分類模型對軟件開髮過程中的故障進行檢測的半鑑督學習方法。首先採用閤成少數類過採樣 SMOTE 算法對數據集中的正例樣本進行過採樣,平衡數據集中的類分佈。在此基礎上閤理構建正例集閤和未標註集閤,採用 POSC 4.5和 Bagging 算法構建軟件故障決策樹集成分類器。通過對 NASA MDP 數據庫中的12箇數據集進行對比實驗,結果錶明,僅用正例和未標註數據建模可以得到與有鑑督學習方法相近的軟件故障檢測率,且集成分類器方法比單分類器方法具有更高的檢測率,未標註樣本集大小的軟件故障檢測率同樣有影響。
침대연건고장수거중정례양본상대교소차대량양본표주곤난적현실장경,이지미표주양본중포함용우건립고장검측모형적대량유용신식,제출부용정례화미표주수거구건분류모형대연건개발과정중적고장진행검측적반감독학습방법。수선채용합성소수류과채양 SMOTE 산법대수거집중적정례양본진행과채양,평형수거집중적류분포。재차기출상합리구건정례집합화미표주집합,채용 POSC 4.5화 Bagging 산법구건연건고장결책수집성분류기。통과대 NASA MDP 수거고중적12개수거집진행대비실험,결과표명,부용정례화미표주수거건모가이득도여유감독학습방법상근적연건고장검측솔,차집성분류기방법비단분류기방법구유경고적검측솔,미표주양본집대소적연건고장검측솔동양유영향。
The software fault datasets were highly possible that there were only a small set of labeled positive data and most of the data was hard to be labeled,which contained a great deal of useful information for building a prediction model for software fault detection.This paper proposed a semi-supervised classification model to predict the faults only using the positive and unla-beled data during the software development process.The proposed method firstly used the SMOTE (synthetic minority over-sampling technique)method to balance the class distribution by oversampling on the rare positive dataset.Then partitioned the improved dataset into positive subset and unlabeled subset properly.Third used the POSC 4.5 algorithm and Bagging algorithm to build a decision tree classification ensemble model for software fault prediction using these subsets.The experiments were conducted on 12 datasets from the NASA MDP database.The experiment results show that the fault detection rate based on positive and unlabeled learning is close to the supervised learning method.The ensemble classifier method can effectively im-prove detective performance than a single classifier method,and the unlabeled level can effect the fault detection somehow.