南京大学学报(自然科学版)
南京大學學報(自然科學版)
남경대학학보(자연과학판)
JOURNAL OF NANJING UNIVERSITY(NATURAL SCIENCES)
2014年
4期
457-465
,共9页
特征选择,集%成,最小最大策略
特徵選擇,集%成,最小最大策略
특정선택,집%성,최소최대책략
feature selection%ensemble%min-max strategy
特征选择是机器学习和数据挖掘中的一个关键问题,它可以实现数据维度的约减,从而提高学习模型的泛化能力。近年来,为了提高特征选择算法的性能,集成思想被应用到特征选择算法中,即将多个基特征选择器进行集成。本文从提高特征选择算法对大规模数据处理能力的角度出发,提出了一种基于最小最大策略的集成特征选择方法。它主要包括三个步骤:第一,将原始数据根据类别信息划分成多个相对较小的平衡数据子集;第二,在每一个数据子集上进行特征选择,得到多个特征选择结果;第三,对多个特征选择结果依据最小-最大策略进行集成,得出最终的特征选择结果。通过实验对比了该集成策略与其它三种集成策略对分类准确率的影响,结果表明最小最大集成策略在大部分情况下能够获得较好的性能,且基于最小最大策略的集成特征选择可以有效处理大规模数据。
特徵選擇是機器學習和數據挖掘中的一箇關鍵問題,它可以實現數據維度的約減,從而提高學習模型的汎化能力。近年來,為瞭提高特徵選擇算法的性能,集成思想被應用到特徵選擇算法中,即將多箇基特徵選擇器進行集成。本文從提高特徵選擇算法對大規模數據處理能力的角度齣髮,提齣瞭一種基于最小最大策略的集成特徵選擇方法。它主要包括三箇步驟:第一,將原始數據根據類彆信息劃分成多箇相對較小的平衡數據子集;第二,在每一箇數據子集上進行特徵選擇,得到多箇特徵選擇結果;第三,對多箇特徵選擇結果依據最小-最大策略進行集成,得齣最終的特徵選擇結果。通過實驗對比瞭該集成策略與其它三種集成策略對分類準確率的影響,結果錶明最小最大集成策略在大部分情況下能夠穫得較好的性能,且基于最小最大策略的集成特徵選擇可以有效處理大規模數據。
특정선택시궤기학습화수거알굴중적일개관건문제,타가이실현수거유도적약감,종이제고학습모형적범화능력。근년래,위료제고특정선택산법적성능,집성사상피응용도특정선택산법중,즉장다개기특정선택기진행집성。본문종제고특정선택산법대대규모수거처리능력적각도출발,제출료일충기우최소최대책략적집성특정선택방법。타주요포괄삼개보취:제일,장원시수거근거유별신식화분성다개상대교소적평형수거자집;제이,재매일개수거자집상진행특정선택,득도다개특정선택결과;제삼,대다개특정선택결과의거최소-최대책략진행집성,득출최종적특정선택결과。통과실험대비료해집성책략여기타삼충집성책략대분류준학솔적영향,결과표명최소최대집성책략재대부분정황하능구획득교호적성능,차기우최소최대책략적집성특정선택가이유효처리대규모수거。
Feature selection is one of the key problems in machine learning and data mining.It involves identifying a subset of the most useful features that produces compatible results as the original entire set of features.It can reduce the dimensionality of original data,speed up the learning process and build comprehensible learning models with good generalization performance.Nowadays,ensemble idea has been used to improve the performance of feature selection by integrating multiple base feature selection models into an ensemble one.It turned out to be effective in dealing with high dimensionality small sample size problem especially for robust biomarker identification.In this paper,we aim to improve the efficiency of feature selection in dealing with large scale problems.In order to deal with such problems,ensemble feature selection using min-max strategy is proposed.The method consists of three main steps:firstly,the original data is decomposed into a group of relatively smaller balanced ones according to their structure and labels.Secondly,feature selection method is used to deal with all of the sub-problems and obtain the result of feature selection,such as feature weight.Lastly,the final result is obtained by combining the different results of sub-problems according to the min-max strategy.The experiments are designed to compare the Min-Max strategy based ensemble method with three other strategies,namely,Mean-Weight,Voting and K-Medoid,on the accuracy of classification.In this paper,G-Mean evaluation metric is choosed for the reason of considering the imbalance of data.Also,different feature selection algorithms and data decomposition methods are choosed to show the effect of the proposed method.The results,on 5 publicly available real-world datasets,demonstrate that the Min-Max strategy is superior to other ones in most cases and ensemble feature selection using Min-Max strategy can effi-ciently deal with large scale data.