软件学报
軟件學報
연건학보
JOURNAL OF SOFTWARE
2014年
9期
2050-2075
,共26页
区分基因子集选择%Pearson 相关系数%Wilcoxon 秩和检验%K-means 聚类%统计相关性%Filter 算法%Wrapper算法
區分基因子集選擇%Pearson 相關繫數%Wilcoxon 秩和檢驗%K-means 聚類%統計相關性%Filter 算法%Wrapper算法
구분기인자집선택%Pearson 상관계수%Wilcoxon 질화검험%K-means 취류%통계상관성%Filter 산법%Wrapper산법
distinguishable gene subset selection%Pearson correlation coefficient%Wilcxon singed-rank test%K-means clustering%statistical correlation%Filter algorithms%Wrapper algorithms
针对高维小样本癌症基因数据集的有效区分基因子集选择难题,提出基于统计相关性和K-means的新颖混合基因选择算法实现有效区分基因子集选择。算法首先采用Pearson相关系数和Wilcoxon秩和检验计算各基因与类标的相关性,根据统计相关性原则选取与类标相关性较大的若干基因构成预选择基因子集;然后,采用K-means算法将预选择基因子集中高度相关的基因聚集到同一类簇,训练 SVM 分类模型,计算每一个基因的权重,从每一类簇选择一个权重最大或者采用轮盘赌思想从每一类簇选择一个得票数最多的基因作为本类簇的代表基因,各类簇的代表基因构成有效区分基因子集。将该算法与采用随机策略选择各类簇代表基因的随机基因选择算法 Random, Guyon的经典基因选择算法SVM-RFE、采用顺序前向搜索策略的基因选择算法SVM-SFS进行实验比较,几个经典基因数据集上的200次重复实验的平均实验结果表明:所提出的混合基因选择算法能够选择到区分性能非常好的基因子集,建立在该区分基因子集上的分类器具有非常好的分类性能。
針對高維小樣本癌癥基因數據集的有效區分基因子集選擇難題,提齣基于統計相關性和K-means的新穎混閤基因選擇算法實現有效區分基因子集選擇。算法首先採用Pearson相關繫數和Wilcoxon秩和檢驗計算各基因與類標的相關性,根據統計相關性原則選取與類標相關性較大的若榦基因構成預選擇基因子集;然後,採用K-means算法將預選擇基因子集中高度相關的基因聚集到同一類簇,訓練 SVM 分類模型,計算每一箇基因的權重,從每一類簇選擇一箇權重最大或者採用輪盤賭思想從每一類簇選擇一箇得票數最多的基因作為本類簇的代錶基因,各類簇的代錶基因構成有效區分基因子集。將該算法與採用隨機策略選擇各類簇代錶基因的隨機基因選擇算法 Random, Guyon的經典基因選擇算法SVM-RFE、採用順序前嚮搜索策略的基因選擇算法SVM-SFS進行實驗比較,幾箇經典基因數據集上的200次重複實驗的平均實驗結果錶明:所提齣的混閤基因選擇算法能夠選擇到區分性能非常好的基因子集,建立在該區分基因子集上的分類器具有非常好的分類性能。
침대고유소양본암증기인수거집적유효구분기인자집선택난제,제출기우통계상관성화K-means적신영혼합기인선택산법실현유효구분기인자집선택。산법수선채용Pearson상관계수화Wilcoxon질화검험계산각기인여류표적상관성,근거통계상관성원칙선취여류표상관성교대적약간기인구성예선택기인자집;연후,채용K-means산법장예선택기인자집중고도상관적기인취집도동일류족,훈련 SVM 분류모형,계산매일개기인적권중,종매일류족선택일개권중최대혹자채용륜반도사상종매일류족선택일개득표수최다적기인작위본류족적대표기인,각류족적대표기인구성유효구분기인자집。장해산법여채용수궤책략선택각류족대표기인적수궤기인선택산법 Random, Guyon적경전기인선택산법SVM-RFE、채용순서전향수색책략적기인선택산법SVM-SFS진행실험비교,궤개경전기인수거집상적200차중복실험적평균실험결과표명:소제출적혼합기인선택산법능구선택도구분성능비상호적기인자집,건립재해구분기인자집상적분류기구유비상호적분류성능。
To deal with the challenging problem of recognizing the small number of distinguishable genes which can tell the cancer patients from normal people in a dataset with a small number of samples and tens of thousands of genes, novel hybrid gene selection algorithms are proposed in this paper based on the statistical correlation and K-means algorithm. The Pearson correlation coefficient and Wilcoxon signed-rank test are respectively adopted to calculate the importance of each gene to the classification to filter the least important genes and preserve about 10 percent of the important genes as the pre-selected gene subset. Then the related genes in the pre-selected gene subset are clustered via K-means algorithm, and the weight of each gene is calculated from the related coefficient of the SVM classifier. The most important gene, with the biggest weight or with the highest votes when the roulette wheel strategy is used, is chosen as the representative gene of each cluster to construct the distinguishable gene subset. In order to verify the effectiveness of the proposed hybrid gene subset selection algorithms, the random selection strategy (named Random) is also adopted to select the representative genes from clusters. The proposed distinguishable gene subset selection algorithms are compared with Random and the very popular gene selection algorithm SVM-RFE by Guyon and the pre-studied gene selection algorithm SVM-SFS. The average experimental results of 200 runs of the aforementioned gene selection algorithms on some classic and very popular gene expression datasets with extensive experiments demonstrate that the proposed distinguishable gene subset selection algorithms can find the optimal gene subset, and the classifier based on the selected gene subset achieves very high classification accuracy.