计算机工程
計算機工程
계산궤공정
COMPUTER ENGINEERING
2014年
1期
177-180
,共4页
刘璐%高强%刘衍珩%孙鑫
劉璐%高彊%劉衍珩%孫鑫
류로%고강%류연형%손흠
实例选择%最近同类实例对%k最近邻%剪辑最近邻规则算法%数据约简%机器学习
實例選擇%最近同類實例對%k最近鄰%剪輯最近鄰規則算法%數據約簡%機器學習
실례선택%최근동류실례대%k최근린%전집최근린규칙산법%수거약간%궤기학습
instance selection%nearest similar instance pair%k nearest neighbor%Edited Nearest Neighbor rule(ENN) algorithm%data reduction%machine learning
实例选择能有效移除数据中的噪声和冗余数据,但现有方法难以在提高泛化能力的同时实现约简。针对该问题,提出一种冗余实例对消除算法用于实例选择。给出最近同类实例对的概念,计算数据集中存在的最近同类实例对,并移除满足条件的实例,在11个不同数据集上进行的仿真实验结果表明,经过该算法处理后的数据集在分类准确率和存储压缩率上较原始样本集有明显提升。对比剪辑最近邻规则算法,该算法能够在保持分类准确率的同时提高平均存储压缩率35%以上,并完整保留原始样本集的数据分布特征,在分类准确率和存储压缩率上取得折中。
實例選擇能有效移除數據中的譟聲和冗餘數據,但現有方法難以在提高汎化能力的同時實現約簡。針對該問題,提齣一種冗餘實例對消除算法用于實例選擇。給齣最近同類實例對的概唸,計算數據集中存在的最近同類實例對,併移除滿足條件的實例,在11箇不同數據集上進行的倣真實驗結果錶明,經過該算法處理後的數據集在分類準確率和存儲壓縮率上較原始樣本集有明顯提升。對比剪輯最近鄰規則算法,該算法能夠在保持分類準確率的同時提高平均存儲壓縮率35%以上,併完整保留原始樣本集的數據分佈特徵,在分類準確率和存儲壓縮率上取得摺中。
실례선택능유효이제수거중적조성화용여수거,단현유방법난이재제고범화능력적동시실현약간。침대해문제,제출일충용여실례대소제산법용우실례선택。급출최근동류실례대적개념,계산수거집중존재적최근동류실례대,병이제만족조건적실례,재11개불동수거집상진행적방진실험결과표명,경과해산법처리후적수거집재분류준학솔화존저압축솔상교원시양본집유명현제승。대비전집최근린규칙산법,해산법능구재보지분류준학솔적동시제고평균존저압축솔35%이상,병완정보류원시양본집적수거분포특정,재분류준학솔화존저압축솔상취득절중。
Instance selection is a kind of effective method to remove the noise and redundant data. According to the unbalance between the generalization ability and reduction in existing instance selection methods, this paper proposes a new instance selection method:Redundant Instance Pair Elimination(RIPE) algorithm. It gives the concept of nearest similar pair, calculates the nearest similar pair of datasets, and removes the eligible instances. The simulation experimental results in 11 different datasets show that the classification accuracy and storage compression ratio of processed dataset are obviously improved compared with original datasets. Contrasted with Edited Nearest Neighbor rule(ENN) algorithm, this algorithm can keep the classification accuracy, improve more than 35% in average storage compression ratio, keep intact the data distribution of original datasets, and make better compromise in the classification accuracy and the storage compression ratio.