管理工程学报
管理工程學報
관리공정학보
Journal of Industrial Engineering and Engineering Management
2013年
4期
119~125
,共null页
琚春华 邹江波 魏建良 张华
琚春華 鄒江波 魏建良 張華
거춘화 추강파 위건량 장화
数据流 概念漂移 情景特征 前馈 动态集成分类器
數據流 概唸漂移 情景特徵 前饋 動態集成分類器
수거류 개념표이 정경특정 전궤 동태집성분류기
data stream ; concept drift ; scenario characteristics; feed-forward ; dynamic ensemble classifier
集成分类器已被广泛应用于数据流分类模型以此削弱概念漂移的影响.通常当基分类器的分类准确率低于特定的阈值时,集成分类器开始学习代替分类准确率低的分类器,以此来克服概念漂移的影响.但仅当基分类器的错误率低于阈值时才开始学习会使集成分类器对当前概念的判断产生一定滞后性,所以本文在集成分类器的基础上,融入了情景特征的分析,采用信息增益的方法提取情景特征,通过动态设置情景特征的阈值来提前预测概念漂移的发生.当情景特征的变化超出情景阈值时,立即通知集成分类器重新学习产生新的基分类器,而不是等到基分类器的准确率低于集成分类器的阈值时才开始学习,这样便使集成分类器具有了一定的前馈性.通过对特定数据的实验分析,证明了本文提出的OCEC (Origin Characteristics Ensemble Classifier)模型降低了挖掘概念漂移数据流时的集成泛化误差,提高了检测概念漂移的有效性.
集成分類器已被廣汎應用于數據流分類模型以此削弱概唸漂移的影響.通常噹基分類器的分類準確率低于特定的閾值時,集成分類器開始學習代替分類準確率低的分類器,以此來剋服概唸漂移的影響.但僅噹基分類器的錯誤率低于閾值時纔開始學習會使集成分類器對噹前概唸的判斷產生一定滯後性,所以本文在集成分類器的基礎上,融入瞭情景特徵的分析,採用信息增益的方法提取情景特徵,通過動態設置情景特徵的閾值來提前預測概唸漂移的髮生.噹情景特徵的變化超齣情景閾值時,立即通知集成分類器重新學習產生新的基分類器,而不是等到基分類器的準確率低于集成分類器的閾值時纔開始學習,這樣便使集成分類器具有瞭一定的前饋性.通過對特定數據的實驗分析,證明瞭本文提齣的OCEC (Origin Characteristics Ensemble Classifier)模型降低瞭挖掘概唸漂移數據流時的集成汎化誤差,提高瞭檢測概唸漂移的有效性.
집성분류기이피엄범응용우수거류분류모형이차삭약개념표이적영향.통상당기분류기적분류준학솔저우특정적역치시,집성분류기개시학습대체분류준학솔저적분류기,이차래극복개념표이적영향.단부당기분류기적착오솔저우역치시재개시학습회사집성분류기대당전개념적판단산생일정체후성,소이본문재집성분류기적기출상,융입료정경특정적분석,채용신식증익적방법제취정경특정,통과동태설치정경특정적역치래제전예측개념표이적발생.당정경특정적변화초출정경역치시,립즉통지집성분류기중신학습산생신적기분류기,이불시등도기분류기적준학솔저우집성분류기적역치시재개시학습,저양편사집성분류기구유료일정적전궤성.통과대특정수거적실험분석,증명료본문제출적OCEC (Origin Characteristics Ensemble Classifier)모형강저료알굴개념표이수거류시적집성범화오차,제고료검측개념표이적유효성.
Data mining techniques have been applied in many fundamental research domains such as retailing, stock market,telecommunications industry, and medicine. Data stream generated from the data in these industries are not stable and they change all the time. Moreover, these changes are unpredictable trigger dynamicity of target concepts which are generally known as concept drifts in the literature. Still, the relationships between hidden context and concepts are not clear. Modeling data flow which contains concept drift is one of core problems in the data mining field because the changeable target concept will reduce the accuracy of the model, and require that the corresponding decision model be revised to process the current inputted data. The models and algorithms used in the existing literature can be categorized into three groups: ( 1 ) instance-based selection learning method, (2) instance-based weighting learning method, and ( 3 ) ensemble classification learning method ( or learning with multiple concept descriptions). The base classifiers are used to reflect the current concept, and predict different classes of samples by integrating all the classification results. Ensemble classifier has been widely used to weaken the impact of concept drift on data stream classification models. When the predictive accuracy of one base classifier in these models is below the given threshold, the ensemble classifier begins to learn a new base classifier and replaces the old one to overcome the influence from the concept drift. However, the ensemble classifier starts to learn only when the accuracy of the base classifier is lower than the threshold. As a result, this may cause a certain lag from the identification of the current concepts for the ensemble classifier. This paper proposes a new method which adds scenario characteristics analysis to the ensemble classifier and adopts the information gain method to extract scenario characteristics. In addition, the threshold of the scenario characteristic is set dynamically to predict the occurrence of concept drift. When the variation of scenario characteristics exceeds the scenario threshold, the ensemble classifier is stimulated immediately to create a new base classifier rather than wait until the accuracy of the base classifier is below the given threshold, which makes the ensemble classifier capable of feed-forward learning. In this work, the proposed OCEC (Origin Characteristics Ensemble Classifier) model should be validated by several computational experiments because OCEC can reduce the integrated generalization error for mining concept drift data streams, and improve the effectiveness of concept drift detection.